Airline Flight Data Analysis – Part 2 – Analyzing On-Time Performance

In my last post on this topic, we loaded the Airline On-Time Performance data set collected by the United States Department of Transportation into a Parquet file to greatly improve the speed at which the data can be analyzed. Now, let’s take a first look at the data by graphing the average airline-caused flight delay by airline. This is a rather straightforward analysis, but is a good one to get started with the data set. Open a Jupyter python notebook on the cluster in the first cell indicate that we will be using MatPlotLib to do graphing: %matplotlib inline Then, in the next cell load data frames for the airline on time activity and airline meta data based on the parquet files built in the last post. Note that I am using QFS as my distributed file system. If you are using HDFS, simply update the file URLs as needed. air_data = spark.read.parquet(‘qfs://master:20000/user/michael/data/airline_data’) airlines = spark.read.parquet(‘qfs://master:20000/user/michael/data/airline_id_table’) Now we are ready to process Read More …

Airline Flight Data Analysis – Part 1 – Data Preparation

This data analysis project is to explore what insights can be derived from the Airline On-Time Performance data set collected by the United States Department of Transportation. The data can be downloaded in month chunks from the Bureau of Transportation Statistics website. The data gets downloaded as a raw CSV file, which is something that Spark can easily load. However, if you download 10+ years of data from the Bureau of Transportation Statistics (meaning you downloaded 120+ one month CSV files from the site), that would collectively represent 30+ GB of data. For commercial scale Spark clusters, 30 GB of text data is a trivial task. However, if you are running Spark on the ODROID XU4 cluster or in local mode on your Mac laptop, 30+ GB of text data is substantial. So, before we can do any analysis of the dataset, we need to transform it into a format that will allow us to quickly and efficiently interact with it. Fortunately, Read More …