Personal Cluster – DIY Big Data

Improving Linux Kernel Network Configuration for Spark on High Performance Networks

June 27, 2020June 28, 2020 michael

My Personal Compute Cluster recent had a failure where only of my nodes disassociated from the cluster and the 2.5 Gbps high speed ethernet link that I had set up through a USB dongle became unresponsive. Investigating the problem, I saw in the system log on that node that the kernel thought it was getting a SYN flood through the 2.5 Gbps ethernet link. Basically, the kernel turned off that networking link because it thought it was getting a DDoS attack. Clearly there wasn’t a true DDoS attack happening since my cluster is on its own network. I researched what would cause this and learned that the standard Linux kernel networking configuration is tuned for 1 Gbps ethernet links. Basically, the intense data transfer between my Spark nodes over the 2.5 Gbps ethernet links filled the kernel’s network queue. To fix the problem, I had to increase the size of the queue. To make the needed improvements to how the Read More …

Upgrading the Compute Cluster to 2.5G Ethernet

March 7, 2020June 27, 2020 michael

I recently updated my Personal Compute Cluster to use faster ethernet interconnect for the cluster network. After putting together a PySpark Benchmark, I was intrigued to see how faster networking within the cluster would help. Note: All product links below are affiliate links. Upgrading the Cluster Networking Hardware To upgrade the networking for the EGLOBAL S200 computers that I use within my cluster, my only real option was to use USB ethernet dongles. This is because the S200 computer has no PCIe expansion slots, but it does have a USB 3 bus. This gives me a few options. There are no 10 Gbps ethernet dongles for USB 3, but there are several 5 and 2.5 Gbps ethernet dongles. These are part fo the more recent NBASE-T ethernet standard which allows faster than 1 Gbps ethernet over Cat 5e and Cat 6 cabling. The first option I investigated was the StarTech USB 3 to 5 Gbps Ethernet Adapter. I am going Read More …

Improving the cooling of the EGLOBAL S200 computer

January 1, 2020January 1, 2020 michael

When I set up my Personal Compute Cluster composed of EGLOBAL S200 computers (affiliate link), one of the first issues I noticed was that the CPUs would overheat when all 12 threads on the CPU (2 threads per core) were being ran at 100% load. This disappointed me about the S200 computer as it was otherwise perfect for my use case. So I set up to find a way to improve the situation. The first step is to be able to measure the CPU temperature by installing lm-sensors, which is a utility that allows you to read the CPU temperature. To install it on all machines in the cluster: Then you need to log into each machine in the cluster one by one and installation the sensor software by running: The software will try to detect the various sensors that the computer has. Accept all the default options as the software presents them to you, even if there is a Read More …

Airline Flight Data Analysis – Part 1 – Data Preparation (Reprise)

December 1, 2019December 1, 2019 michael

As some of you know, I previously explored building a Spark cluster using ODROID XU-4 single board computers. I was able to demonstrate some utility with this cluster, but it was limited. One analysis I attempted looking at was the Airline On-Time Performance data set collected by the United States Department of Transportation. The XU-4 cluster allowed me to make summarization graphs of the data, but not much more. The primary constraint was the RAM pool the cluster had, which was 8 GB total RAM across the four nodes, and the 10 year data set was greater than 30 GB uncompressed. Now that I have built the Personal Compute Cluster, I decided to revise this data set to see if I could do more sophisticated analysis. The short answer is I can. But before I do that we need to prepare the raw data that we download from the Department of Transportation’s website into a format. Specifically we need to Read More …

Running Spark with QFS on a Docker Swarm

October 27, 2019November 23, 2019 michael

In my last post, we put together a quick build that runs Apache Spark on the Personal Compute Cluster. The goal of that build was to simply demonstrate how to run Spark on a Docker Swarm. What it did not address well was how Spark would access data. I did set up GlusterFS shared volume that did allow Spark to read and write data files. However, this was not fast because the set up did not allow Spark to take advantage of data locality when scheduling task. As a result, almost all data reads and writes are done across the ethernet backbone. The goal in this post is to set up a Spark build on Docker Swarm that is paired with a distributed file system set up in a manner where data locality can be leveraged to speed up processing. There are choices for distributed file systems, the most notable being Hadoop’s HDFS and the Quantcast File System (QFS). For Read More …

Running Spark on a Docker Swarm

September 22, 2019November 23, 2019 michael

My project to create a Personal Compute Cluster has finally come to a point where I can create the first usable deployment of Apache Spark. The goal here is simply to get a usable deployment of Spark, but not necessarily a robust deployment. What this means is that I will set up Spark to run without HDFS or QFS as the distributed file system to hold data. Instead this set up will use the GlusterFS volume I created in my last post. The reason GlusterFS is not ideal for holding data that Spark will analyze is that Spark cannot take advantage of data locality with GlusterFS with the simple set up of GlusterFS that I did. Given my setup, Spark will see the GlusterFS volume as a local file system mount on each of the nodes. Because the GlusterFS volume presents the same files on each of the nodes, the GlusterFS volume behaves like a distributed file system. Spark just Read More …

Installing GlusterFS on the Personal Compute Cluster

September 14, 2019November 23, 2019 michael

When I set up the personal computer cluster‘s individual nodes, the SSD was set up to have three partitions. The partition mounted at /mnt/brick will be used to set up a GlusterFS volume. This GlusterFS volume will be used to persist data for services we run on the Docker Swarm that was previously set up on the cluster. The GlusterFS software needs to be installed on all of the nodes in the cluster. Here is here pssh makes it easy to set up all nodes at the same time. Now set up the GlusterFS volume from the master node and set the the master node’s firewall to enable all traffic from within the cluster’s LAN: With the last command, you should see the list of nodes all connected. node1 will be listed as localhost since you are currently on it. The next step is to create a GlusterFS volume. First a folder on the /mnt/brick partition is created on every Read More …

Setting up a Docker Swarm on the Personal Compute Cluster

September 2, 2019April 17, 2020 michael

The last time I set up a Spark cluster, I installed Spark manually and configured each node directly. For the small scale cluster I had, that was fine. This time time the cluster is still relatively small scale. However, I do want to take advantage of a prebuilt containers, if possible. The standard for that is Docker. So in this post I will set up a Docker Swarm on the Personal Compute Cluster. Installing Docker Before we begin, if you set up SETI@Home as the end of the last post, we need to stop it first (skip if you didn’t set up SETI@Home): We need to ensure that swap is off, as various applications we will be working with later do not like it: Now we install all the needed software: The installation of the software should have created the user group docker. Now we need to add our user account to that group so we can run docker without Read More …

Building Out Nodes for Personal Cluster

September 1, 2019April 17, 2020 michael

Its time to build the cluster. I have all the parts described for my cluster in my prior post, and today’s goal is to get the computers set and fingered as described in my cluster network design. This is a rather long post as I cover both hardware and operating system set up. At the end of this process, we will have a cluster of computers that is ready to have application software installed. I suggest building all the nodes first before starting with the operating system installation. Before you get started with these steps, I highly recommend that you start the download for the Ubuntu 18.04 LTS Server installation ISO. Note that the EGLOBAL S200 computers couldn’t use the live install ISO for some reason, but everything worked just find with the alternative (or old style) Ubuntu installer. You can download the latest version of that installer here. Select the “64-bit PC (AMD64) server install image” version. Cluster Node Read More …

Project Kickoff – Personal Compute Cluster 2019 Edition

September 1, 2019January 5, 2020 michael

Three years ago, I worked through a project of creating a low cost, low power computer cluster, with the primary goal of becoming more familiar with the inner works of Apache Spark. This project did accomplish that goal, but since this cluster was made up of 32-bit ARM processors and each had only 2 GB of RAM, the cluster was not too useful for getting meaningful work done. What it did excel at, however, was showing the user how to write Spark code efficiently. If you wanted to get anything done on this constrained system you had to be mindful of every inefficiency. Jump ahead to 2019, and I have decided to give it another go. This time, I want to make a cluster that is moderately useful for data analysis and machine learning, but does not break the bank either. The first step is to document my system design requirements and general goals: The cluster’s primary purpose will be Read More …

DIY Big Data

One byte at a time …

Category: Personal Cluster