Applying the PageRank Algorithm to Airline Flight Data

Now that we have the Airline On-Time Performance data set loaded into parquet files on the Personal Compute Cluster, we can take a look to see what the data set tells us about the state of air transportation in the United States. The first thing I will look at is determining which are the most important airports in the United States.

A simple way to determine the most important airport is to count number of flights that handled. However, given that most airline routes leverage a hub-and-spoke system, simply counting flights in and out does not convey how important the airport is for routing airline traffic in general. This because certain hub airports might be critical junctures for the flights in other airports, and as a result, those hub airports might be considered more important even if they have identical or even less flight counts.

So how can we identify the most important airports in the United States considering how any given airport influences flights into another airport? If we model the flight data as a directed graph, we can use the P a geRank algorithm.

I am not going to go over the details of the PageRank algorithm here, as there are a multitude of resources for that. However, I will quickly explain why PageRank is applicable here. PageRank was originally invented to measure the relative importance of webpages by evaluating the web links into that webpage. Any given web page is considered to be more important if other important we pages have web links into that web page. We can apply this same concept of importance to airports. If you replace “web page” with “airport” and replace “web link” with “airline flight”, you can see that PageRank can be used to evaluate the importance of an airport. PageRank ends up being a good way to measure airport importance given the airlines’ hub-and-spoke system for scheduling flights. An important airport would end up being an airport that is itself a hub that other hub airports have many flights into, or a “hub of hubs”.

Below is the PySpark code I used to perform PageRank analysis on the airline data. The Jupyter notebook with this code can be viewed here. Note that the airline data was originally loaded as described in this post.

Placing Airline Data into Graph

The first step in applying the PageRank algorithm is to place it into a graph. To do this. we will be using the GraphFrames package for Apache Spark. To leverage this package, you will note the command I use to launch the Jupyter notebook server includes as directive to love the GraphFrames package with spark.

The first step in the analysis is to simply start the Spark sessions and load the airline data we previously saved to parquet files. Note that I am using the QFS distributed file system on my cluster, hence the qfs:// prefix to my file paths. This code can easily be converted to other file systems, such as HDFS. Also, for this analysis I am using the 20-year flight data set.

%matplotlib inline

import pyspark.sql.functions as F
import pyspark.sql.types as T
import graphframes as GF
import matplotlib.pyplot as plt
import pandas as pd
import datetime as dt

spark = SparkSession\
        .builder\
        .appName("AirportGraphAnalysis")\
        .getOrCreate()

on_time_df = spark.read.parquet('qfs:///data/airline/processed/airline_data')
airport_codes_df = spark.read.parquet('qfs:///data/airline/processed/airport_id_table')
airline_ids_df = spark.read.parquet('qfs:///data/airline/processed/DOT_airline_codes_table')

In order to use the GraphFrames package, you need to create two data frames with very specific formats. The first data frame is a vertices data frame that has at least two columns, id and name, with a row for each vertex in the graph. In this analysis, each airport will be a vertex. The second data fram described all the edges in the graph that has at least two columns, src and dst, which describes the directed edge between any two vertices. In this analysis, each airline flight will be an edge in the graph with the departure airport being the src and the arrival airport being the dst. Then each of these data frames are used to create a graph object using the GraphFrames package. Note that both data frames can have additional columns that add custom data to each vertex or edge. You will note below that I add the year and month to each flight (edge). This will be used later to perform monthly analysis.

airport_vertices = (
    airport_codes_df
    .withColumnRenamed('AirportID','id')
    .withColumnRenamed('Name','name')
    .select('id','name','City')
)

airport_edges = (
    on_time_df
    .select(
        F.col('OriginAirportID').alias('src'),
        F.col('DestAirportID').alias('dst'),
        'AirlineID',
        F.format_string('%d-%02d',F.col('Year'),F.col('Month')).alias('YearMonth')
    )
    .join(
        airline_ids_df.select('AirlineID','AirlineName'),
        on='AirlineID',
        how='inner'
    )
    .drop('AirlineID')
).cache()


# create GraphFrame
graph = GF.GraphFrame(airport_vertices, airport_edges)

Determining Airport Overall PageRank

Once the GraphFrames graph object is built, running the PageRank algorithm on it is is straight forward. In the code below, it is simply the first line. I then pull out the top 20 ranked airports and display them using and as elegant datagram display in Jupyter notebooks.

overall_ranks = graph.pageRank(resetProbability=0.01, maxIter=20)

(
    overall_ranks.vertices
    .orderBy("pagerank", ascending=False)
).limit(20).toPandas()

This then will result in the table displayed below.

	id	name	City	pagerank
0	10397	Hartsfield-Jackson Atlanta International	Atlanta, GA	326.211559
1	13930	Chicago O’Hare International	Chicago, IL	287.742369
2	11298	Dallas/Fort Worth International	Dallas/Fort Worth, TX	245.531416
3	12892	Los Angeles International	Los Angeles, CA	191.200115
4	11292	Denver International	Denver, CO	178.852588
5	14107	Phoenix Sky Harbor International	Phoenix, AZ	159.152823
6	12266	George Bush Intercontinental/Houston	Houston, TX	149.647836
7	12889	McCarran International	Las Vegas, NV	132.304597
8	14771	San Francisco International	San Francisco, CA	129.775437
9	11433	Detroit Metro Wayne County	Detroit, MI	124.944428
10	13487	Minneapolis-St Paul International	Minneapolis, MN	121.456924
11	11618	Newark Liberty International	Newark, NJ	113.497130
12	11057	Charlotte Douglas International	Charlotte, NC	111.874808
13	10721	Logan International	Boston, MA	104.619218
14	14869	Salt Lake City International	Salt Lake City, UT	104.344611
15	13204	Orlando International	Orlando, FL	101.302695
16	14747	Seattle/Tacoma International	Seattle, WA	99.250359
17	12953	LaGuardia	New York, NY	97.665209
18	14100	Philadelphia International	Philadelphia, PA	87.743066
19	10821	Baltimore/Washington International Thurgood Ma…	Baltimore, MD	86.864185

As you can see that if you consider all the airline flights from 1998 to 2018, Atlanta’s Hartsfield-Jackson airport ends up being the most important airport in the United States.

Determining the Monthly Trends in Airport PageRanks

Looking at the singular most important airport over a 20-year span is interesting, but identifying the trend in an airport’s PageRank over time is even more interesting. To do that, we must first calculate the PageRank of each airport month-by-month over time. The code below will build a single dataframe that has the by-month PageRank for each airport. It should be noted that the PageRank algorithm is iterative, and so running it against the monthly data over a 20 year span will take a while. Also, the value of a PageRank score only has meaning in the context of the dataset it was calculated for. You cannot compare the PAgeRank score between data sets without significant analysis. An easier way to make the month-to-month PageRank scores comparable with each other is to normalize them against the highest PageRank score for that month. That is why the code below adds a normalized_rank column to the data frame.

# calculate the monthly "pageranks" of the airports. Normalize all scores so that
# top ranks airport has a score of 1.0, and all other scores are proportional to that.

month_set = [r.YearMonth for r in airport_edges.select('YearMonth').distinct().collect()]

monthly_ranks_list = [
    (
        graph
        .filterEdges('YearMonth = "{0}"'.format(m))
        .pageRank(resetProbability=0.01, maxIter=20)
        .vertices
        .withColumn('YearMonth',F.lit(m).cast(T.StringType()))
    ) for m in month_set
]

monthly_ranks = None

for cur_month_ranks in monthly_ranks_list:
    maxRank = cur_month_ranks.agg(F.max('pagerank').alias('maxRank')).collect()[0].maxRank
    normalized_ranks = cur_month_ranks.withColumn('normalized_rank', F.col('pagerank')/maxRank)
    if monthly_ranks is None:
        monthly_ranks = normalized_ranks
    else:
        monthly_ranks = monthly_ranks.union(normalized_ranks)
monthly_ranks = monthly_ranks.cache()

Since calculating the monthly PageRank scores over 20 years takes a while, I save the results to a file so I can load them later to analyze without having to recalculate them. Note that I repartition the results, as I was trying to balance the partition file size (the large the better) with partition count (the more the better). I simply “eyeballed” the partition count so that the page rank data’s part files were about 1 MB in size.

#
# save the monthly airport pagerank data to disk so that analysis can be revisted later without
# having to recalculate from the beginning (which takes a while).
#

overall_ranks.vertices.repartition(3).write.parquet('qfs:///data/airline/analysis/overall_airport_pageranks', mode='overwrite')
monthly_ranks.repartition(12).write.parquet('qfs:///data/airline/analysis/monthly_airport_pageranks', mode='overwrite')

Finally, I can reload the data from the files just created, and create a graph showing the PageRank trends over time for the top United States airports.

# load saved calculated airport pagerank data and analyze for top N airports

top_airport_count = 15

top_airports = (
    spark.read.parquet('qfs:///data/airline/analysis/overall_airport_pageranks')
    .orderBy("pagerank", ascending=False)
    .limit(top_airport_count)
    .coalesce(1)
)

monthly_ranks_top = (
    spark.read.parquet('qfs:///data/airline/analysis/monthly_airport_pageranks')
    .join(
        F.broadcast(top_airports.select('id')),
        on='id',
        how='inner'
    )
)

airport_set = [(r.name, r.City) for r in top_airports.select('name','City').collect()]

# graph the monthly airport normalized pageranks

from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()

monthly_ranks_top_pd = monthly_ranks_top.orderBy('YearMonth').toPandas()

fig, ax = plt.subplots()

for airport in airport_set:
    rank_data = monthly_ranks_top_pd[monthly_ranks_top_pd['City'] == airport[1]]
    dates_raw = rank_data['YearMonth']
    dates = [dt.datetime.strptime(date,'%Y-%m') for date in dates_raw]
    ax.plot(
        dates,
        rank_data['normalized_rank'],
        label = '{0} ({1})'.format(airport[0], airport[1]),
        linewidth = 2
    )
    
fig.set_size_inches(16,12)
plt.xlabel("Month")
plt.ylabel("Rank")
plt.setp(plt.legend(loc='lower left', prop={'size': 14}, bbox_to_anchor=(0,1.02,1,0.2)).get_lines(), linewidth=6)    
plt.show()

This will result in a graph that looks like this:

This graph makes it easy to see that while Atlanta’s airport may be the overall most important airport, it isn’t always in that position. Prior to 2003 Chigaco’s O’Hare airport was the most important airport in the United States.

Looking at this graph, you can also quickly see some interesting structure in the data. One that stands out to me is the various step changes in PageRank that seems to happen across a group of airports at the same time. For example, if you follow Atlanta’s normalized PageRank, you can see that it and a number of other airports (but not all) had their scores suddenly drop about 2001, and then spike back up about 2003. I suspect the cause of this is due to one or two influential airlines making a broad change to their routes. However, I will save the analysis to discover the reason behind these changes for a future post.

Conclusion

Using the GraphFrames package, calculating the PageRank of a directed graph is very easy to do in Apache Spark. This sort of analysis can be applied to historical airline flight information to determine the most important airport in the United States.

DIY Big Data

One byte at a time …

Airline Flight Data Analysis – Airport “PageRank”

Placing Airline Data into Graph

Determining Airport Overall PageRank

Determining the Monthly Trends in Airport PageRanks

Conclusion

Leave a Reply Cancel reply