Running Spark with QFS on a Docker Swarm

In my last post, we put together a quick build that runs Apache Spark on the Personal Compute Cluster. The goal of that build was to simply demonstrate how to run Spark on a Docker Swarm. What it did not address well was how Spark would access data. I did set up GlusterFS shared volume that did allow Spark to read and write data files. However, this was not fast because the set up did not allow Spark to take advantage of data locality when scheduling task. As a result, almost all data reads and writes are done across the ethernet backbone. The goal in this post is to set up a Spark build on Docker Swarm that is paired with a distributed file system set up in a manner where data locality can be leveraged to speed up processing. There are choices for distributed file systems, the most notable being Hadoop’s HDFS and the Quantcast File System (QFS). For Read More …

Running Spark on a Docker Swarm

My project to create a Personal Compute Cluster has finally come to a point where I can create the first usable deployment of Apache Spark. The goal here is simply to get a usable deployment of Spark, but not necessarily a robust deployment. What this means is that I will set up Spark to run without HDFS or QFS as the distributed file system to hold data. Instead this set up will use the GlusterFS volume I created in my last post. The reason GlusterFS is not ideal for holding data that Spark will analyze is that Spark cannot take advantage of data locality with GlusterFS with the simple set up of GlusterFS that I did. Given my setup, Spark will see the GlusterFS volume as a local file system mount on each of the nodes. Because the GlusterFS volume presents the same files on each of the nodes, the GlusterFS volume behaves like a distributed file system. Spark just Read More …

Installing GlusterFS on the Personal Compute Cluster

When I set up the personal computer cluster‘s individual nodes, the SSD was set up to have three partitions. The partition mounted at /mnt/brick will be used to set up a GlusterFS volume. This GlusterFS volume will be used to persist data for services we run on the Docker Swarm that was previously set up on the cluster. The GlusterFS software needs to be installed on all of the nodes in the cluster. Here is here pssh makes it easy to set up all nodes at the same time. Now set up the GlusterFS volume from the master node and set the the master node’s firewall to enable all traffic from within the cluster’s LAN: With the last command, you should see the list of nodes all connected. node1 will be listed as localhost since you are currently on it. The next step is to create a GlusterFS volume. First a folder on the /mnt/brick partition is created on every Read More …

Setting up a Docker Swarm on the Personal Compute Cluster

The last time I set up a Spark cluster, I installed Spark manually and configured each node directly. For the small scale cluster I had, that was fine. This time time the cluster is still relatively small scale. However, I do want to take advantage of a prebuilt containers, if possible. The standard for that is Docker. So in this post I will set up a Docker Swarm on the Personal Compute Cluster. Installing Docker Before we begin, if you set up SETI@Home as the end of the last post, we need to stop it first (skip if you didn’t set up SETI@Home): We need to ensure that swap is off, as various applications we will be working with later do not like it: Now we install all the needed software: The installation of the software should have created the user group docker. Now we need to add our user account to that group so we can run docker without Read More …

Building Out Nodes for Personal Cluster

Its time to build the cluster. I have all the parts described for my cluster in my prior post, and today’s goal is to get the computers set and fingered as described in my cluster network design. This is a rather long post as I cover both hardware and operating system set up. At the end of this process, we will have a cluster of computers that is ready to have application software installed. I suggest building all the nodes first before starting with the operating system installation. Before you get started with these steps, I highly recommend that you start the download for the Ubuntu 18.04 LTS Server installation ISO. Note that the EGLOBAL S200 computers couldn’t use the live install ISO for some reason, but everything worked just find with the alternative (or old style) Ubuntu installer. You can download the latest version of that installer here. Select the “64-bit PC (AMD64) server install image” version. Cluster Node Read More …

Project Kickoff – Personal Compute Cluster 2019 Edition

Three years ago, I worked through a project of creating a low cost, low power computer cluster, with the primary goal of becoming more familiar with the inner works of Apache Spark. This project did accomplish that goal, but since this cluster was made up of 32-bit ARM processors and each had only 2 GB of RAM, the cluster was not too useful for getting meaningful work done. What it did excel at, however, was showing the user how to write Spark code efficiently. If you wanted to get anything done on this constrained system you had to be mindful of every inefficiency. Jump ahead to 2019, and I have decided to give it another go. This time, I want to make a cluster that is moderately useful for data analysis and machine learning, but does not break the bank either. The first step is to document my system design requirements and general goals: The cluster’s primary purpose will be Read More …