## Identifying Bot Commenters on Reddit using Benford’s Law

How can you identify non-human actors on social media? There are many ways to do this, and each method has its strengths and weaknesses. In this post, I discuss how to use Benford’s Law to identify non-human actors in user interaction logs. Application of Benford’s Law Benford’s Law is an observation that a collection of numbers that measure naturally occurring events of items tend to have a logarithm frequency distribution for the first digit of these numbers. The are several characteristics of a naturally occurring set of numbers that Benford’s Law takes advantage of: The order of magnitude of the number in the set varies uniformly The numbers vary with multiplicative fluctuations The distribution of numbers is scale invariant The exact distribution of first digits that Benford’s Law predicts is: This results in a distribution that looks like this: For a collection of numbers, if the frequency of the numbers’ first digits does not align well with the distribution shown Read More …

## Airline Flight Data Analysis – Airport “PageRank”

Now that we have the Airline On-Time Performance data set loaded into parquet files on the Personal Compute Cluster, we can take a look to see what the data set tells us about the state of air transportation in the United States. The first thing I will look at is determining which are the most important airports in the United States. A simple way to determine the most important airport is to count number of flights that handled. However, given that most airline routes leverage a hub-and-spoke system, simply counting flights in and out does not convey how important the airport is for routing airline traffic in general. This because certain hub airports might be critical junctures for the flights in other airports, and as a result, those hub airports might be considered more important even if they have identical or even less flight counts. So how can we identify the most important airports in the United States considering how Read More …

## Airline Flight Data Analysis – Part 1 – Data Preparation (Reprise)

As some of you know, I previously explored building a Spark cluster using ODROID XU-4 single board computers. I was able to demonstrate some utility with this cluster, but it was limited. One analysis I attempted looking at was the Airline On-Time Performance data set collected by the United States Department of Transportation. The XU-4 cluster allowed me to make summarization graphs of the data, but not much more. The primary constraint was the RAM pool the cluster had, which was 8 GB total RAM across the four nodes, and the 10 year data set was greater than 30 GB uncompressed. Now that I have built the Personal Compute Cluster, I decided to revise this data set to see if I could do more sophisticated analysis. The short answer is I can. But before I do that we need to prepare the raw data that we download from the Department of Transportation’s website into a format. Specifically we need to Read More …

## Airline Flight Data Analysis – Part 2 – Analyzing On-Time Performance

In my last post on this topic, we loaded the Airline On-Time Performance data set collected by the United States Department of Transportation into a Parquet file to greatly improve the speed at which the data can be analyzed. Now, let’s take a first look at the data by graphing the average airline-caused flight delay by airline. This is a rather straightforward analysis, but is a good one to get started with the data set. Open a Jupyter python notebook on the cluster in the first cell indicate that we will be using MatPlotLib to do graphing: %matplotlib inline Then, in the next cell load data frames for the airline on time activity and airline meta data based on the parquet files built in the last post. Note that I am using QFS as my distributed file system. If you are using HDFS, simply update the file URLs as needed. air_data = spark.read.parquet(‘qfs://master:20000/user/michael/data/airline_data’) airlines = spark.read.parquet(‘qfs://master:20000/user/michael/data/airline_id_table’) Now we are ready to process Read More …