Airline Flight Data Analysis – Part 1 – Data Preparation

UPDATE – I have a more modern version of this post with larger data sets available here. This data analysis project is to explore what insights can be derived from the Airline On-Time Performance data set collected by the United States Department of Transportation. The data can be downloaded in month chunks from the Bureau of Transportation Statistics website. The data gets downloaded as a raw CSV file, which is something that Spark can easily load. However, if you download 10+ years of data from the Bureau of Transportation Statistics (meaning you downloaded 120+ one month CSV files from the site), that would collectively represent 30+ GB of data. For commercial scale Spark clusters, 30 GB of text data is a trivial task. However, if you are running Spark on the ODROID XU4 cluster or in local mode on your Mac laptop, 30+ GB of text data is substantial. So, before we can do any analysis of the dataset, we need to Read More …

Adding a New Node to the ODROID XU4 Cluster

I recently acquired another ODROID XU4 device (and a MicroSD card for bulk storage) to add it to my XU4 cluster. This new node brings my node count to five. Adding a new node to the cluster is relatively straight forward, but there are a lot of details. In the modern datacenter, this operation would be accomplished through a package manager, which would build the new node according to an image. However, I haven’t set up package management on my cluster so I will need to sit up the new node manually. In the original cluster configuration, I used a 5-port ethernet switch for the internal network. Given that there was an open port, no additional hardware beyond the new node is needed. However, if I were to add a sixth node (or beyond), I would need to update my ethernet switch to something such as an 8-port switch. I will also note that using the 40 mm PCB spacers I originally ordered makes Read More …

Connecting to HDFS from an computer external to the cluster

Since I have set up my ODROID XU4 cluster to work with Spark from a Jupyter web notebook, one of the little annoyances I have had is how inefficient it was to transferring data into the HDFS file system on the cluster.  It involved downloading data to my Mac laptop via a web download, transferring that data to the master node, then running hdfs dfs -put  to place the data onto the file system. While I did set up my cluster to create an NFS mount, it can be very slow when transferring large files to HDFS. So, my solution to this was to install HDFS on my Mac laptop, and configure it to use the ODROID XU4 cluster. One of the requirements for an external client (e.g., your Mac laptop) to interact with HDFS on a cluster is that the external client must be able to connect to every node in the cluster. Since the master node of the cluster was Read More …

Coverting Apache Access Logs to Parquet Backed Data Frames on Spark

One of the analysis I am looking to do with my ODROID XU4 cluster is to take a look at various access patterns on my website by analyzing the Apache http access logs. Analyzing Apache access logs directly in Spark can be slow due to them being unstructured text logs. Converting to the logs to a data frame backed by partitioned parquet files can make subsequent analysis much faster. The first task is to create a mapper that can be used in Spark convert a row int eh access log to a Spark Row object. A Python 3 mapper would look like: # Parse an Apache access log. Assumes Python 3 import re from pyspark.sql import Row from datetime import datetime APACHE_ACCESS_LOG_PATTERN = ‘^(\S+) (\S+) (\S+) \[([\w:/]+\s[+\-]\d{4})\] “(\S+) (\S+) (\S+)” (\d{3}) (\d+) “((?:[^”]|”)+)” “((?:[^”]|”)+)”$’ DATETIME_PARSE_PATTERN = ‘%d/%b/%Y:%H:%M:%S %z’ # Returns a Row containing the Apache Access Log info def parse_apache_log_line(logline): match = re.search(APACHE_ACCESS_LOG_PATTERN, logline) if match is None: return None Read More …