Docker – DIY Big Data

Running Spark with QFS on a Docker Swarm

October 27, 2019 michael

In my last post, we put together a quick build that runs Apache Spark on the Personal Compute Cluster. The goal of that build was to simply demonstrate how to run Spark on a Docker Swarm. What it did not address well was how Spark would access data. I did set up GlusterFS shared volume that did allow Spark to read and write data files. However, this was not fast because the set up did not allow Spark to take advantage of data locality when scheduling task. As a result, almost all data reads and writes are done across the ethernet backbone. The goal in this post is to set up a Spark build on Docker Swarm that is paired with a distributed file system set up in a manner where data locality can be leveraged to speed up processing. There are choices for distributed file systems, the most notable being Hadoop’s HDFS and the Quantcast File System (QFS). For Read More …

Running Spark on a Docker Swarm

September 22, 2019 michael

My project to create a Personal Compute Cluster has finally come to a point where I can create the first usable deployment of Apache Spark. The goal here is simply to get a usable deployment of Spark, but not necessarily a robust deployment. What this means is that I will set up Spark to run without HDFS or QFS as the distributed file system to hold data. Instead this set up will use the GlusterFS volume I created in my last post. The reason GlusterFS is not ideal for holding data that Spark will analyze is that Spark cannot take advantage of data locality with GlusterFS with the simple set up of GlusterFS that I did. Given my setup, Spark will see the GlusterFS volume as a local file system mount on each of the nodes. Because the GlusterFS volume presents the same files on each of the nodes, the GlusterFS volume behaves like a distributed file system. Spark just Read More …

Setting up a Docker Swarm on the Personal Compute Cluster

September 2, 2019 michael

The last time I set up a Spark cluster, I installed Spark manually and configured each node directly. For the small scale cluster I had, that was fine. This time time the cluster is still relatively small scale. However, I do want to take advantage of a prebuilt containers, if possible. The standard for that is Docker. So in this post I will set up a Docker Swarm on the Personal Compute Cluster. Installing Docker Before we begin, if you set up SETI@Home as the end of the last post, we need to stop it first (skip if you didn’t set up SETI@Home): We need to ensure that swap is off, as various applications we will be working with later do not like it: Now we install all the needed software: The installation of the software should have created the user group docker. Now we need to add our user account to that group so we can run docker without Read More …