michael – DIY Big Data

Improving Linux Kernel Network Configuration for Spark on High Performance Networks

June 27, 2020 michael

My Personal Compute Cluster recent had a failure where only of my nodes disassociated from the cluster and the 2.5 Gbps high speed ethernet link that I had set up through a USB dongle became unresponsive. Investigating the problem, I saw in the system log on that node that the kernel thought it was getting a SYN flood through the 2.5 Gbps ethernet link. Basically, the kernel turned off that networking link because it thought it was getting a DDoS attack. Clearly there wasn’t a true DDoS attack happening since my cluster is on its own network. I researched what would cause this and learned that the standard Linux kernel networking configuration is tuned for 1 Gbps ethernet links. Basically, the intense data transfer between my Spark nodes over the 2.5 Gbps ethernet links filled the kernel’s network queue. To fix the problem, I had to increase the size of the queue. To make the needed improvements to how the Read More …

Identifying Bot Commenters on Reddit using Benford’s Law

March 14, 2020October 11, 2025 michael

How can you identify non-human actors on social media? There are many ways to do this, and each method has its strengths and weaknesses. In this post, I discuss how to use Benford’s Law to identify non-human actors in user interaction logs. Application of Benford’s Law Benford’s Law is an observation that a collection of numbers that measure naturally occurring events of items tend to have a logarithm frequency distribution for the first digit of these numbers. The are several characteristics of a naturally occurring set of numbers that Benford’s Law takes advantage of: The exact distribution of first digits that Benford’s Law predicts is: This results in a distribution that looks like this: For a collection of numbers, if the frequency of the numbers’ first digits does not align well with the distribution shown above, then that collection of numbers is suspected to not have been generated by a natural process. You can this use this principal to identify Read More …

Upgrading the Compute Cluster to 2.5G Ethernet

March 7, 2020 michael

I recently updated my Personal Compute Cluster to use faster ethernet interconnect for the cluster network. After putting together a PySpark Benchmark, I was intrigued to see how faster networking within the cluster would help. Note: All product links below are affiliate links. Upgrading the Cluster Networking Hardware To upgrade the networking for the EGLOBAL S200 computers that I use within my cluster, my only real option was to use USB ethernet dongles. This is because the S200 computer has no PCIe expansion slots, but it does have a USB 3 bus. This gives me a few options. There are no 10 Gbps ethernet dongles for USB 3, but there are several 5 and 2.5 Gbps ethernet dongles. These are part fo the more recent NBASE-T ethernet standard which allows faster than 1 Gbps ethernet over Cat 5e and Cat 6 cabling. The first option I investigated was the StarTech USB 3 to 5 Gbps Ethernet Adapter. I am going Read More …

Benchmarking Software for PySpark on Apache Spark Clusters

January 24, 2020October 11, 2025 michael

Now that my Personal Compute Cluster is uninhibited by CPU overheating, I want to turn my configuration to work as efficiently as possible for the type of workloads I place on it. I searched around for Apache Spark benchmarking software, however most of what I found was either too older (circa Spark 1.x) or too arcane. I was able to get Ewan Higgs’s implementation of TeraSort working on my cluster, but it was written in Scala and not necessarily representative of the type of operations I would use in PySpark. So I set out to write my own. The primary goal of my benchmarking approach is to have a standard set of data and operations that I can compare the performance of before and after some change I make to my Spark deployment and be confident that any change in performance was due to the change in the Spark deployment and not due to variability in the benchmark. This attribute Read More …

Improving the cooling of the EGLOBAL S200 computer

January 1, 2020October 11, 2025 michael

When I set up my Personal Compute Cluster composed of EGLOBAL S200 computers (affiliate link), one of the first issues I noticed was that the CPUs would overheat when all 12 threads on the CPU (2 threads per core) were being ran at 100% load. This disappointed me about the S200 computer as it was otherwise perfect for my use case. So I set up to find a way to improve the situation. The first step is to be able to measure the CPU temperature by installing lm-sensors, which is a utility that allows you to read the CPU temperature. To install it on all machines in the cluster: Then you need to log into each machine in the cluster one by one and installation the sensor software by running: The software will try to detect the various sensors that the computer has. Accept all the default options as the software presents them to you, even if there is a Read More …

Airline Flight Data Analysis – Airport “PageRank”

December 28, 2019October 11, 2025 michael

Now that we have the Airline On-Time Performance data set loaded into parquet files on the Personal Compute Cluster, we can take a look to see what the data set tells us about the state of air transportation in the United States. The first thing I will look at is determining which are the most important airports in the United States. A simple way to determine the most important airport is to count number of flights that handled. However, given that most airline routes leverage a hub-and-spoke system, simply counting flights in and out does not convey how important the airport is for routing airline traffic in general. This because certain hub airports might be critical junctures for the flights in other airports, and as a result, those hub airports might be considered more important even if they have identical or even less flight counts. So how can we identify the most important airports in the United States considering how Read More …

Airline Flight Data Analysis – Part 1 – Data Preparation (Reprise)

December 1, 2019 michael

As some of you know, I previously explored building a Spark cluster using ODROID XU-4 single board computers. I was able to demonstrate some utility with this cluster, but it was limited. One analysis I attempted looking at was the Airline On-Time Performance data set collected by the United States Department of Transportation. The XU-4 cluster allowed me to make summarization graphs of the data, but not much more. The primary constraint was the RAM pool the cluster had, which was 8 GB total RAM across the four nodes, and the 10 year data set was greater than 30 GB uncompressed. Now that I have built the Personal Compute Cluster, I decided to revise this data set to see if I could do more sophisticated analysis. The short answer is I can. But before I do that we need to prepare the raw data that we download from the Department of Transportation’s website into a format. Specifically we need to Read More …

Running Spark with QFS on a Docker Swarm

October 27, 2019 michael

In my last post, we put together a quick build that runs Apache Spark on the Personal Compute Cluster. The goal of that build was to simply demonstrate how to run Spark on a Docker Swarm. What it did not address well was how Spark would access data. I did set up GlusterFS shared volume that did allow Spark to read and write data files. However, this was not fast because the set up did not allow Spark to take advantage of data locality when scheduling task. As a result, almost all data reads and writes are done across the ethernet backbone. The goal in this post is to set up a Spark build on Docker Swarm that is paired with a distributed file system set up in a manner where data locality can be leveraged to speed up processing. There are choices for distributed file systems, the most notable being Hadoop’s HDFS and the Quantcast File System (QFS). For Read More …

Running Spark on a Docker Swarm

September 22, 2019 michael

My project to create a Personal Compute Cluster has finally come to a point where I can create the first usable deployment of Apache Spark. The goal here is simply to get a usable deployment of Spark, but not necessarily a robust deployment. What this means is that I will set up Spark to run without HDFS or QFS as the distributed file system to hold data. Instead this set up will use the GlusterFS volume I created in my last post. The reason GlusterFS is not ideal for holding data that Spark will analyze is that Spark cannot take advantage of data locality with GlusterFS with the simple set up of GlusterFS that I did. Given my setup, Spark will see the GlusterFS volume as a local file system mount on each of the nodes. Because the GlusterFS volume presents the same files on each of the nodes, the GlusterFS volume behaves like a distributed file system. Spark just Read More …

Installing GlusterFS on the Personal Compute Cluster

September 14, 2019October 11, 2025 michael

When I set up the personal computer cluster‘s individual nodes, the SSD was set up to have three partitions. The partition mounted at /mnt/brick will be used to set up a GlusterFS volume. This GlusterFS volume will be used to persist data for services we run on the Docker Swarm that was previously set up on the cluster. The GlusterFS software needs to be installed on all of the nodes in the cluster. Here is here pssh makes it easy to set up all nodes at the same time. Now set up the GlusterFS volume from the master node and set the the master node’s firewall to enable all traffic from within the cluster’s LAN: With the last command, you should see the list of nodes all connected. node1 will be listed as localhost since you are currently on it. The next step is to create a GlusterFS volume. First a folder on the /mnt/brick partition is created on every Read More …