Airline Flight Data Analysis – Part 1 – Data Preparation
UPDATE – I have a more modern version of this post with larger data sets available here. This data analysis project is to explore what insights can be derived from the Airline On-Time Performance data set collected by the United States Department of Transportation. The data can be downloaded in month chunks from the Bureau of Transportation Statistics website. The data gets downloaded as a raw CSV file, which is something that Spark can easily load. However, if you download 10+ years of data from the Bureau of Transportation Statistics (meaning you downloaded 120+ one month CSV files from the site), that would collectively represent 30+ GB of data. For commercial scale Spark clusters, 30 GB of text data is a trivial task. However, if you are running Spark on the ODROID XU4 cluster or in local mode on your Mac laptop, 30+ GB of text data is substantial. So, before we can do any analysis of the dataset, we need to Read More …