Now that we have the Airline On-Time Performance data set loaded into parquet files on the Personal Compute Cluster, we can take a look to see what the data set tells us about the state of air transportation in the United States. The first thing I will look at is determining which are the most important airports in the United States.
A simple way to determine the most important airport is to count number of flights that handled. However, given that most airline routes leverage a hub-and-spoke system, simply counting flights in and out does not convey how important the airport is for routing airline traffic in general. This because certain hub airports might be critical junctures for the flights in other airports, and as a result, those hub airports might be considered more important even if they have identical or even less flight counts.
So how can we identify the most important airports in the United States considering how any given airport influences flights into another airport? If we model the flight data as a directed graph, we can use the PageRank algorithm.
I am not going to go over the details of the PageRank algorithm here, as there are a multitude of resources for that. However, I will quickly explain why PageRank is applicable here. PageRank was originally invented to measure the relative importance of webpages by evaluating the web links into that webpage. Any given web page is considered to be more important if other important we pages have web links into that web page. We can apply this same concept of importance to airports. If you replace “web page” with “airport” and replace “web link” with “airline flight”, you can see that PageRank can be used to evaluate the importance of an airport. PageRank ends up being a good way to measure airport importance given the airlines’ hub-and-spoke system for scheduling flights. An important airport would end up being an airport that is itself a hub that other hub airports have many flights into, or a “hub of hubs”.
Below is the PySpark code I used to perform PageRank analysis on the airline data. The Jupyter notebook with this code can be viewed here. Note that the airline data was originally loaded as described in this post.
Placing Airline Data into Graph
The first step in applying the PageRank algorithm is to place it into a graph. To do this. we will be using the GraphFrames package for Apache Spark. To leverage this package, you will note the command I use to launch the Jupyter notebook server includes as directive to love the GraphFrames package with spark.
The first step in the analysis is to simply start the Spark sessions and load the airline data we previously saved to parquet files. Note that I am using the QFS distributed file system on my cluster, hence the qfs://
prefix to my file paths. This code can easily be converted to other file systems, such as HDFS. Also, for this analysis I am using the 20-year flight data set.
%matplotlib inline import pyspark.sql.functions as F import pyspark.sql.types as T import graphframes as GF import matplotlib.pyplot as plt import pandas as pd import datetime as dt spark = SparkSession\ .builder\ .appName("AirportGraphAnalysis")\ .getOrCreate() on_time_df = spark.read.parquet('qfs:///data/airline/processed/airline_data') airport_codes_df = spark.read.parquet('qfs:///data/airline/processed/airport_id_table') airline_ids_df = spark.read.parquet('qfs:///data/airline/processed/DOT_airline_codes_table')
In order to use the GraphFrames package, you need to create two data frames with very specific formats. The first data frame is a vertices data frame that has at least two columns, id
and name
, with a row for each vertex in the graph. In this analysis, each airport will be a vertex. The second data fram described all the edges in the graph that has at least two columns, src
and dst
, which describes the directed edge between any two vertices. In this analysis, each airline flight will be an edge in the graph with the departure airport being the src
and the arrival airport being the dst
. Then each of these data frames are used to create a graph object using the GraphFrames package. Note that both data frames can have additional columns that add custom data to each vertex or edge. You will note below that I add the year and month to each flight (edge). This will be used later to perform monthly analysis.
airport_vertices = ( airport_codes_df .withColumnRenamed('AirportID','id') .withColumnRenamed('Name','name') .select('id','name','City') ) airport_edges = ( on_time_df .select( F.col('OriginAirportID').alias('src'), F.col('DestAirportID').alias('dst'), 'AirlineID', F.format_string('%d-%02d',F.col('Year'),F.col('Month')).alias('YearMonth') ) .join( airline_ids_df.select('AirlineID','AirlineName'), on='AirlineID', how='inner' ) .drop('AirlineID') ).cache() # create GraphFrame graph = GF.GraphFrame(airport_vertices, airport_edges)
Determining Airport Overall PageRank
Once the GraphFrames graph object is built, running the PageRank algorithm on it is is straight forward. In the code below, it is simply the first line. I then pull out the top 20 ranked airports and display them using and as elegant datagram display in Jupyter notebooks.
overall_ranks = graph.pageRank(resetProbability=0.01, maxIter=20) ( overall_ranks.vertices .orderBy("pagerank", ascending=False) ).limit(20).toPandas()
This then will result in the table displayed below.
id | name | City | pagerank | |
---|---|---|---|---|
0 | 10397 | Hartsfield-Jackson Atlanta International | Atlanta, GA | 326.211559 |
1 | 13930 | Chicago O’Hare International | Chicago, IL | 287.742369 |
2 | 11298 | Dallas/Fort Worth International | Dallas/Fort Worth, TX | 245.531416 |
3 | 12892 | Los Angeles International | Los Angeles, CA | 191.200115 |
4 | 11292 | Denver International | Denver, CO | 178.852588 |
5 | 14107 | Phoenix Sky Harbor International | Phoenix, AZ | 159.152823 |
6 | 12266 | George Bush Intercontinental/Houston | Houston, TX | 149.647836 |
7 | 12889 | McCarran International | Las Vegas, NV | 132.304597 |
8 | 14771 | San Francisco International | San Francisco, CA | 129.775437 |
9 | 11433 | Detroit Metro Wayne County | Detroit, MI | 124.944428 |
10 | 13487 | Minneapolis-St Paul International | Minneapolis, MN | 121.456924 |
11 | 11618 | Newark Liberty International | Newark, NJ | 113.497130 |
12 | 11057 | Charlotte Douglas International | Charlotte, NC | 111.874808 |
13 | 10721 | Logan International | Boston, MA | 104.619218 |
14 | 14869 | Salt Lake City International | Salt Lake City, UT | 104.344611 |
15 | 13204 | Orlando International | Orlando, FL | 101.302695 |
16 | 14747 | Seattle/Tacoma International | Seattle, WA | 99.250359 |
17 | 12953 | LaGuardia | New York, NY | 97.665209 |
18 | 14100 | Philadelphia International | Philadelphia, PA | 87.743066 |
19 | 10821 | Baltimore/Washington International Thurgood Ma… | Baltimore, MD | 86.864185 |
As you can see that if you consider all the airline flights from 1998 to 2018, Atlanta’s Hartsfield-Jackson airport ends up being the most important airport in the United States.
Determining the Monthly Trends in Airport PageRanks
Looking at the singular most important airport over a 20-year span is interesting, but identifying the trend in an airport’s PageRank over time is even more interesting. To do that, we must first calculate the PageRank of each airport month-by-month over time. The code below will build a single dataframe that has the by-month PageRank for each airport. It should be noted that the PageRank algorithm is iterative, and so running it against the monthly data over a 20 year span will take a while. Also, the value of a PageRank score only has meaning in the context of the dataset it was calculated for. You cannot compare the PAgeRank score between data sets without significant analysis. An easier way to make the month-to-month PageRank scores comparable with each other is to normalize them against the highest PageRank score for that month. That is why the code below adds a normalized_rank
column to the data frame.
# calculate the monthly "pageranks" of the airports. Normalize all scores so that # top ranks airport has a score of 1.0, and all other scores are proportional to that. month_set = [r.YearMonth for r in airport_edges.select('YearMonth').distinct().collect()] monthly_ranks_list = [ ( graph .filterEdges('YearMonth = "{0}"'.format(m)) .pageRank(resetProbability=0.01, maxIter=20) .vertices .withColumn('YearMonth',F.lit(m).cast(T.StringType())) ) for m in month_set ] monthly_ranks = None for cur_month_ranks in monthly_ranks_list: maxRank = cur_month_ranks.agg(F.max('pagerank').alias('maxRank')).collect()[0].maxRank normalized_ranks = cur_month_ranks.withColumn('normalized_rank', F.col('pagerank')/maxRank) if monthly_ranks is None: monthly_ranks = normalized_ranks else: monthly_ranks = monthly_ranks.union(normalized_ranks) monthly_ranks = monthly_ranks.cache()
Since calculating the monthly PageRank scores over 20 years takes a while, I save the results to a file so I can load them later to analyze without having to recalculate them. Note that I repartition the results, as I was trying to balance the partition file size (the large the better) with partition count (the more the better). I simply “eyeballed” the partition count so that the page rank data’s part files were about 1 MB in size.
# # save the monthly airport pagerank data to disk so that analysis can be revisted later without # having to recalculate from the beginning (which takes a while). # overall_ranks.vertices.repartition(3).write.parquet('qfs:///data/airline/analysis/overall_airport_pageranks', mode='overwrite') monthly_ranks.repartition(12).write.parquet('qfs:///data/airline/analysis/monthly_airport_pageranks', mode='overwrite')
Finally, I can reload the data from the files just created, and create a graph showing the PageRank trends over time for the top United States airports.
# load saved calculated airport pagerank data and analyze for top N airports top_airport_count = 15 top_airports = ( spark.read.parquet('qfs:///data/airline/analysis/overall_airport_pageranks') .orderBy("pagerank", ascending=False) .limit(top_airport_count) .coalesce(1) ) monthly_ranks_top = ( spark.read.parquet('qfs:///data/airline/analysis/monthly_airport_pageranks') .join( F.broadcast(top_airports.select('id')), on='id', how='inner' ) ) airport_set = [(r.name, r.City) for r in top_airports.select('name','City').collect()] # graph the monthly airport normalized pageranks from pandas.plotting import register_matplotlib_converters register_matplotlib_converters() monthly_ranks_top_pd = monthly_ranks_top.orderBy('YearMonth').toPandas() fig, ax = plt.subplots() for airport in airport_set: rank_data = monthly_ranks_top_pd[monthly_ranks_top_pd['City'] == airport[1]] dates_raw = rank_data['YearMonth'] dates = [dt.datetime.strptime(date,'%Y-%m') for date in dates_raw] ax.plot( dates, rank_data['normalized_rank'], label = '{0} ({1})'.format(airport[0], airport[1]), linewidth = 2 ) fig.set_size_inches(16,12) plt.xlabel("Month") plt.ylabel("Rank") plt.setp(plt.legend(loc='lower left', prop={'size': 14}, bbox_to_anchor=(0,1.02,1,0.2)).get_lines(), linewidth=6) plt.show()
This will result in a graph that looks like this:
This graph makes it easy to see that while Atlanta’s airport may be the overall most important airport, it isn’t always in that position. Prior to 2003 Chigaco’s O’Hare airport was the most important airport in the United States.
Looking at this graph, you can also quickly see some interesting structure in the data. One that stands out to me is the various step changes in PageRank that seems to happen across a group of airports at the same time. For example, if you follow Atlanta’s normalized PageRank, you can see that it and a number of other airports (but not all) had their scores suddenly drop about 2001, and then spike back up about 2003. I suspect the cause of this is due to one or two influential airlines making a broad change to their routes. However, I will save the analysis to discover the reason behind these changes for a future post.
Conclusion
Using the GraphFrames package, calculating the PageRank of a directed graph is very easy to do in Apache Spark. This sort of analysis can be applied to historical airline flight information to determine the most important airport in the United States.