My project to create a Personal Compute Cluster has finally come to a point where I can create the first usable deployment of Apache Spark. The goal here is simply to get a usable deployment of Spark, but not necessarily a robust deployment. What this means is that I will set up Spark to run without HDFS or QFS as the distributed file system to hold data. Instead this set up will use the GlusterFS volume I created in my last post. The reason GlusterFS is not ideal for holding data that Spark will analyze is that Spark cannot take advantage of data locality with GlusterFS with the simple set up of GlusterFS that I did. Given my setup, Spark will see the GlusterFS volume as a local file system mount on each of the nodes. Because the GlusterFS volume presents the same files on each of the nodes, the GlusterFS volume behaves like a distributed file system. Spark just doesn’t know for each file which server the data is stored on, and thus can’t take advantage of data locality. For smaller data sets, this is perfectly acceptable. For larger data sets, this might become more of a performance issue. I will explore how to take advantage of data locality in a future post. Today’s goal is simply to get a Spark cluster up and running on the Personal Compute Cluster and try to take advantage of all 12 threads in each of the EGLOBAL S200 computers.
Building and Running Spark
The first thing to do is to pull my git repository containing software and tools for the Personal Computer cluster:
cd mkdir repositories cd repositories git clone git@github.com:DIYBigData/personal-compute-cluster.git cd personal-compute-cluster/simple-spark-swarm/
At the time of this posts writing, I have developed one Docker stack for setting up a simple Spark cluster. This Spark cluster will use the GlusterFS mount set up in the last post for its data source, and make available a Jupyter notebook for interacting with the Spark engine. In the next section, I will review the Docker setup for creating a Spark cluster on the Docker Swarm that was previously set up on the Personal Compute Cluster. If you want to get straight to running Spark on the cluster, jump ahead to the Deploying Spark on the Swarm section.
Docker Swarm Stack Configuration for Apache Spark
Configuring a stack to run Apache Spark on a Docker Swarm is pretty straight forward. This simple Spark on Swarm set up can be found in the simple-spark-swarm
directory of my repository, which contains a Docker compose file that defines a stack to deploy on the swarm. The stack’s compose file defines three services that need to get created on the swarm:
spark-master
: This service is the Spark master for the cluster. There will be one instance of this service running in the swarm. It doesn’t matter what node it runs on.spark-worker
: This service is where the Spark executors run. This service will be run with four replicas (or however many physical nodes the cluster has) and is configured to create two Spark executors with five threads each and utilize about 80% of the RAM each node has.spark-jupyter
: This service launches the Jupyter notebook server and connects it to the Spark cluster running in the swarm.
The simple-spark-swarm
directory in the repository contains a Docker compose file called deploy-spark-swarm.yml
. This compose file defines and configures each of the services listed above. There are several things to note from this compose file:
- Each of the services’ image files are expected to be on the local Docker repository running on swarm. There is a
build-images.sh
file in thesimple-spark-swarm
directory. If you look closely at this shell script, you’ll see how each of the images are built and pushed to the local repository. - I set CPU limits on each of the services. This is reasonable for the
spark-master
andspark-jupyter
services, but it may not be obvious why I put a CPU limit on thespark-worker
even as I configure the Spark deployment on the worker to use all 12 threads the CPU supports. The reason for this is that as of this post’s writing, I haven’t upgrade the cooling system in the EGLOBAL S200 nodes. The factory installed cooling system doesn’t adequately cool the CPU when all 12 threads are at 100%. Setting the CPU limit to 8 has the effect of throttling each thread to 66% capacity. I found this prevents the CPU from overheating. When I upgrade the nodes’ CPU coolers, then I will revise this setting for thespark-worker
. - All three services mount a
/data
directory to the GlusterFS volume at/mnt/gfs/data
. This is where the data will be placed that Spark will analyze. Note that the spark-jupyter service also mounts the GlusterFS volume at/mnt/gfs/jupyter-notebooks
. This is where Jupyter will save any notebook you create. - A swarm network called
spark-network
is created for all the services. This allows the use of service names in configuration files in the place of IP addresses. Given that, we need to expose ports to allow
In the simple-spark-swarm
directory there are also two sub-directories to create the needed Docker images. The first Docker image is configured-spark-node
, which is used for both the Spark mast and Spark workers services, each with a different command. This image depends on the gettyimages/spark base image, and install matplotlib & pandas plus adds the desired Spark configuration for the Personal Compute Cluster. The second Docker image is spark-jupyter-notebook
. This image builds on configured-spark-node
by adding the Jupyter notebook server and configuring it. Both of these images are built by running the build-images.sh
script.
Deploying Spark on Swarm
Assuming that you are deploying Spark to a Docker swarm that is configured similar to my Personal Computer Cluster, deploying the Spark stack is pretty straight forward. If your cluster is different in any way, you may need to adjust some of the configurations in the various aspects of the Spark stack that I defined. The README file in the simple-spark-swarm
directory explains what you might need to adjust. I will not cover that here.
The first step in deploying the simple Spark stack is to create the directories on the GusterFS volume that the stack is expecting:
mkdir -p /mnt/gfs/jupyter-notebooks mkdir -p /mnt/gfs/data
Now, you simply launch the stack. Here I assume the current working directory is the simple-spark-swarm
directory:
chmod u+x build-images.sh ./build-images.sh docker stack deploy -c deploy-spark-swarm.yml spark
Then point your web browser at http://cluster-ip:7777
and you should get the Jupyter notebook main top level interface. Note that the spark-jupyter-notebook
service is configured to launch Jupyter without authentication or a token. Since this is a private cluster, I figured the lack of security is OK. You can easily change this my editing the spark-jupyter-notebook
launch command.
Testing the Spark on Swarm Deployment
Next Want to test things out. This isn’t intended to be a full tutorial on Jupyter + Spark usage, there are plenty of resources for that. But I do want to do a quick job just to test that things are working.
The first step would be to create new new notebook in the Jupyter UI. We will use a simple program to calculate and count the number of prime numbers below a given maximum value. To create a test notebook, click on the “New” button on the upper right area, and select “Python 3” from the options. You will get a new Jupyter note book. Enter the following program that implements the Prime Number search (based on this algorithm) into the notebook’s first entry and press shift-return:
import pyspark.sql.functions as F import pyspark.sql.types as T spark = SparkSession\ .builder\ .appName("CalculatePrimeNumbers")\ .getOrCreate() MAX_VALUE = 100000000 def isPrime(val): if val <= 3: return val > 1 elif val%2 == 0 or val%3 == 0: return False else: i = 5 while i*i < val: if val%i == 0 or val%(i + 2) == 0: return False i += 6 return True values = spark.sparkContext.parallelize( range(1,MAX_VALUE+1), 2000 ).map( lambda x: (x, isPrime(x)) ).toDF().withColumnRenamed('_1', 'value').withColumnRenamed('_2', 'is_prime').cache() values.filter(F.col('is_prime')).count()
The cluster should start crunch away trying to find all the prime numbers up to MAX_VALUE and then count the number of prime numbers count. On my four node cluster, this task takes about 2.5 minutes for finding all the primes up to 100 billion. While the job is running, you can check out the Spark status page at http://cluster-ip:4040
.
That’s it! At this point you have a working Spark cluster with a Jupyter notebook “frontend”. In future posts I will explore what we can do with this simple cluster, and then I will explore how to take advantage of data locality using Spark in a Docker Swarm.