Data Analysis – DIY Big Data

Identifying Bot Commenters on Reddit using Benford’s Law

March 14, 2020October 11, 2025 michael

How can you identify non-human actors on social media? There are many ways to do this, and each method has its strengths and weaknesses. In this post, I discuss how to use Benford’s Law to identify non-human actors in user interaction logs. Application of Benford’s Law Benford’s Law is an observation that a collection of numbers that measure naturally occurring events of items tend to have a logarithm frequency distribution for the first digit of these numbers. The are several characteristics of a naturally occurring set of numbers that Benford’s Law takes advantage of: The exact distribution of first digits that Benford’s Law predicts is: This results in a distribution that looks like this: For a collection of numbers, if the frequency of the numbers’ first digits does not align well with the distribution shown above, then that collection of numbers is suspected to not have been generated by a natural process. You can this use this principal to identify Read More …

Airline Flight Data Analysis – Airport “PageRank”

December 28, 2019October 11, 2025 michael

Now that we have the Airline On-Time Performance data set loaded into parquet files on the Personal Compute Cluster, we can take a look to see what the data set tells us about the state of air transportation in the United States. The first thing I will look at is determining which are the most important airports in the United States. A simple way to determine the most important airport is to count number of flights that handled. However, given that most airline routes leverage a hub-and-spoke system, simply counting flights in and out does not convey how important the airport is for routing airline traffic in general. This because certain hub airports might be critical junctures for the flights in other airports, and as a result, those hub airports might be considered more important even if they have identical or even less flight counts. So how can we identify the most important airports in the United States considering how Read More …

Airline Flight Data Analysis – Part 1 – Data Preparation (Reprise)

December 1, 2019 michael

As some of you know, I previously explored building a Spark cluster using ODROID XU-4 single board computers. I was able to demonstrate some utility with this cluster, but it was limited. One analysis I attempted looking at was the Airline On-Time Performance data set collected by the United States Department of Transportation. The XU-4 cluster allowed me to make summarization graphs of the data, but not much more. The primary constraint was the RAM pool the cluster had, which was 8 GB total RAM across the four nodes, and the 10 year data set was greater than 30 GB uncompressed. Now that I have built the Personal Compute Cluster, I decided to revise this data set to see if I could do more sophisticated analysis. The short answer is I can. But before I do that we need to prepare the raw data that we download from the Department of Transportation’s website into a format. Specifically we need to Read More …

Airline Flight Data Analysis – Part 2 – Analyzing On-Time Performance

September 3, 2016 michael

In my last post on this topic, we loaded the Airline On-Time Performance data set collected by the United States Department of Transportation into a Parquet file to greatly improve the speed at which the data can be analyzed. Now, let’s take a first look at the data by graphing the average airline-caused flight delay by airline. This is a rather straightforward analysis, but is a good one to get started with the data set. Open a Jupyter python notebook on the cluster in the first cell indicate that we will be using MatPlotLib to do graphing: %matplotlib inline Then, in the next cell load data frames for the airline on time activity and airline meta data based on the parquet files built in the last post. Note that I am using QFS as my distributed file system. If you are using HDFS, simply update the file URLs as needed. air_data = spark.read.parquet(‘qfs://master:20000/user/michael/data/airline_data’) airlines = spark.read.parquet(‘qfs://master:20000/user/michael/data/airline_id_table’) Now we are ready to process Read More …

Airline Flight Data Analysis – Part 1 – Data Preparation

August 28, 2016 michael

UPDATE – I have a more modern version of this post with larger data sets available here. This data analysis project is to explore what insights can be derived from the Airline On-Time Performance data set collected by the United States Department of Transportation. The data can be downloaded in month chunks from the Bureau of Transportation Statistics website. The data gets downloaded as a raw CSV file, which is something that Spark can easily load. However, if you download 10+ years of data from the Bureau of Transportation Statistics (meaning you downloaded 120+ one month CSV files from the site), that would collectively represent 30+ GB of data. For commercial scale Spark clusters, 30 GB of text data is a trivial task. However, if you are running Spark on the ODROID XU4 cluster or in local mode on your Mac laptop, 30+ GB of text data is substantial. So, before we can do any analysis of the dataset, we need to Read More …

Coverting Apache Access Logs to Parquet Backed Data Frames on Spark

August 7, 2016 michael

One of the analysis I am looking to do with my ODROID XU4 cluster is to take a look at various access patterns on my website by analyzing the Apache http access logs. Analyzing Apache access logs directly in Spark can be slow due to them being unstructured text logs. Converting to the logs to a data frame backed by partitioned parquet files can make subsequent analysis much faster. The first task is to create a mapper that can be used in Spark convert a row int eh access log to a Spark Row object. A Python 3 mapper would look like: # Parse an Apache access log. Assumes Python 3 import re from pyspark.sql import Row from datetime import datetime APACHE_ACCESS_LOG_PATTERN = ‘^(\S+) (\S+) (\S+) \[([\w:/]+\s[+\-]\d{4})\] “(\S+) (\S+) (\S+)” (\d{3}) (\d+) “((?:[^”]|”)+)” “((?:[^”]|”)+)”$’ DATETIME_PARSE_PATTERN = ‘%d/%b/%Y:%H:%M:%S %z’ # Returns a Row containing the Apache Access Log info def parse_apache_log_line(logline): match = re.search(APACHE_ACCESS_LOG_PATTERN, logline) if match is None: return None Read More …