Running Spark with QFS on a Docker Swarm

In my last post, we put together a quick build that runs Apache Spark on the Personal Compute Cluster. The goal of that build was to simply demonstrate how to run Spark on a Docker Swarm. What it did not address well was how Spark would access data. I did set up GlusterFS shared volume that did allow Spark to read and write data files. However, this was not fast because the set up did not allow Spark to take advantage of data locality when scheduling task. As a result, almost all data reads and writes are done across the ethernet backbone. The goal in this post is to set up a Spark build on Docker Swarm that is paired with a distributed file system set up in a manner where data locality can be leveraged to speed up processing. There are choices for distributed file systems, the most notable being Hadoop’s HDFS and the Quantcast File System (QFS). For Read More …

Running Spark on a Docker Swarm

My project to create a Personal Compute Cluster has finally come to a point where I can create the first usable deployment of Apache Spark. The goal here is simply to get a usable deployment of Spark, but not necessarily a robust deployment. What this means is that I will set up Spark to run without HDFS or QFS as the distributed file system to hold data. Instead this set up will use the GlusterFS volume I created in my last post. The reason GlusterFS is not ideal for holding data that Spark will analyze is that Spark cannot take advantage of data locality with GlusterFS with the simple set up of GlusterFS that I did. Given my setup, Spark will see the GlusterFS volume as a local file system mount on each of the nodes. Because the GlusterFS volume presents the same files on each of the nodes, the GlusterFS volume behaves like a distributed file system. Spark just Read More …

Installing GlusterFS on the Personal Compute Cluster

When I set up the personal computer cluster‘s individual nodes, the SSD was set up to have three partitions. The partition mounted at /mnt/brick will be used to set up a GlusterFS volume. This GlusterFS volume will be used to persist data for services we run on the Docker Swarm that was previously set up on the cluster. The GlusterFS software needs to be installed on all of the nodes in the cluster. Here is here pssh makes it easy to set up all nodes at the same time. Now set up the GlusterFS volume from the master node and set the the master node’s firewall to enable all traffic from within the cluster’s LAN: With the last command, you should see the list of nodes all connected. node1 will be listed as localhost since you are currently on it. The next step is to create a GlusterFS volume. First a folder on the /mnt/brick partition is created on every Read More …

Setting up a Docker Swarm on the Personal Compute Cluster

The last time I set up a Spark cluster, I installed Spark manually and configured each node directly. For the small scale cluster I had, that was fine. This time time the cluster is still relatively small scale. However, I do want to take advantage of a prebuilt containers, if possible. The standard for that is Docker. So in this post I will set up a Docker Swarm on the Personal Compute Cluster. Installing Docker Before we begin, if you set up SETI@Home as the end of the last post, we need to stop it first (skip if you didn’t set up SETI@Home): We need to ensure that swap is off, as various applications we will be working with later do not like it: Now we install all the needed software: The installation of the software should have created the user group docker. Now we need to add our user account to that group so we can run docker without Read More …

Building Out Nodes for Personal Cluster

Its time to build the cluster. I have all the parts described for my cluster in my prior post, and today’s goal is to get the computers set and fingered as described in my cluster network design. This is a rather long post as I cover both hardware and operating system set up. At the end of this process, we will have a cluster of computers that is ready to have application software installed. I suggest building all the nodes first before starting with the operating system installation. Before you get started with these steps, I highly recommend that you start the download for the Ubuntu 18.04 LTS Server installation ISO. Note that the EGLOBAL S200 computers couldn’t use the live install ISO for some reason, but everything worked just find with the alternative (or old style) Ubuntu installer. You can download the latest version of that installer here. Select the “64-bit PC (AMD64) server install image” version. Cluster Node Read More …

Project Kickoff – Personal Compute Cluster 2019 Edition

Three years ago, I worked through a project of creating a low cost, low power computer cluster, with the primary goal of becoming more familiar with the inner works of Apache Spark. This project did accomplish that goal, but since this cluster was made up of 32-bit ARM processors and each had only 2 GB of RAM, the cluster was not too useful for getting meaningful work done. What it did excel at, however, was showing the user how to write Spark code efficiently. If you wanted to get anything done on this constrained system you had to be mindful of every inefficiency. Jump ahead to 2019, and I have decided to give it another go. This time, I want to make a cluster that is moderately useful for data analysis and machine learning, but does not break the bank either. The first step is to document my system design requirements and general goals: The cluster’s primary purpose will be Read More …

Upgrading ODROID Cluster to Ubuntu 16.04

I decided to rebuild my ODROID XU4 cluster’s OS with a fresh build using Ubuntu 16.04. The previous build used Ubuntu 14.04. But seeing as 14.04 is several years old at this point, I wanted to upgrade my cluster to the most recent major build of Ubuntu Linux.  For the most part, the process of building ou the OS was pretty much the same, but there are some key differences on the master node. This post is just a wholesale copy of my previous two posts (here and here) covering the same topic, but updated for Ubuntu 16.04. Configuring the ODROID XU4 Operating System Temporary Networking Setup When setting up the nodes initially, you will need to SSH into them to configure their settings. However, if we go straight to our network design, we will not be able to connect to any node between the master node is not yet set up as a router. So we will need to Read More …

ARM7 CPUs, Double Alignment, and Apache Spark

I haven’t posted an update to my data analysis projects in a while. Partly because my day job has been a bit busy lately, and partly because what time I do have for my recreational coding has been taken up by a problem I was experiencing with Apache Spark. I started have stability problems on my ODROID XU4 cluster. I didn’t fully understand the cause at first, thinking for the longest time it was my own code. In the end, it proved to be a bug in spark, or more specifically, an incompatibility between Spark’s memory management and the ARM71 platform of my ODROID XU4 cluster. The issue has to do with how some CPUs operate on double floating point values. These CPUs, including the 32-bit ARM71 CPU found in the ODROID XU4, requires that when the CPU operates on a double floating point value the 8 bytes of memory used to contain the value should be aligned to 8 byte Read More …

Quantcast File System 1.2 for ARM71

NOTE – This article has been updated. It now assumes you have set up the cluster with Ubuntu 16.04, and it has the latest builds of QFS v1.2.1 and Spark v2.2.0. I have been using the Quantcast File System (QFS) as my primary distributed file system on my ODROID XU4 cluster.  Due to QFS’s low memory footprint, it works well with Spark, allowing me to assign as much of the ODROID XU4’s limited 2 GB RAM footprint to the Spark executor running on a node. Recently, QFS 1.2 was released. This version brings many features and updates, many not relevant to my ODROID cluster use case. However, the most notable updates relevant to the ODROID XU4 cluster include: Correct Spark’s ability to create a hive megastore on a new QFS instance (QFS-332) Improved error reporting in the QFS/HDFS shim HDFS shim for the Hadoop 2.7.2 API, which the latest versions of Spark use. In this post, I will update the ODROID XU4 Read More …

Using Custom Hive UDFs With PySpark

Using Python to develop on Apache Spark is easy and familiar for many developers. However, due to the fact that Spark runs in a JVM, when your Python code interacts with the underlying Spark system, there can be an expensive process of data serialization and deserialization between the JVM and the Python interpreter. If you do most of your data manipulation using data frames in PySpark, you generally avoid this serialization cost because the Python code ends up being more of a high-level coordinator of the data frame operations rather than doing low-level operations on the data itself. This changes if you ever write a UDF in Python. To avoid the JVM-to-Python data serialization costs, you can use a Hive UDF written in Java. Creating a Hive UDF and then using it within PySpark can be a bit circuitous, but it does speed up your PySpark data frame flows if they are using Python UDFs. To illustrate this, I will rework the Read More …