Using the Quantcast File System with Spark on the ODROID XU4 Cluster

I used to work at a company named Quantcast, which is known for processing petabyte-scale data sets. When operating at that scale, any inefficiency in your data infrastructure gets multiplied many times over. Early on, Quantcast found that the inefficiencies in Hadoop literally cost them by requiring more hardware to obtain certain levels of performance. So Quantcast set out to write a highly efficient big data stack. At the bottom of that stack was Quantcast’s distributed file system called QFS. Quantcast open-sourced QFS, and performed several studies to show the efficiency gains QFS had over HDFS. Given the first hand experience I have had with QFS performance, I thought it would interesting to get it running on the ODROID XU4 cluster in combination with Spark to see how much QFS would improve processing speeds in our set up. Spoiler alert: there was no time improvements, but QFS required less memory than HDFS, allowing me to assign more RAM to Spark.

Installing QFS

While QFS can be a more powerful file system, it is also a bit more raw in terms of supporting tools and installation ease as compared to our previous experience with installing HDFS. In such, there are more details to this installation to get the file system running. I did create a tool that will enable you to launch QFS on the cluster with the same ease of launching the HDFS file system, but more on that later.

Note that these instructions could be applied to the cluster as it exists after the last blog post, or on a new cluster that has just been set up with no installed software. Ultimately you should choose if you are going to persist with HDFS or QFS as your file system on the cluster.

The first thing to do is to install all of QFS itself. While Quantcast does distribute some pre-built binaries, none of them are built for ARM system. Compiling QFS is rather straightforward, but time consuming. I have made a build for ARMv71 of QFS’s last stable release.

QFS 1.2.1 for ARM v7.1

To directly download my distributions of QFS from my server, do the following as hduser on the master node as hduser:

cd /opt
sudo wget http://diybigdata.net/downloads/qfs/qfs-ubuntu-16.04.3-1.2.1-armv7l.tgz
sudo tar xvzf qfs-ubuntu-16.04.3-1.2.1-armv7l.tgz
sudo chown -R hduser:hadoop qfs-ubuntu-16.04.3-1.2.1-armv7l
sudo ln -s /opt/qfs-ubuntu-16.04.3-1.2.1-armv7l /usr/local/qfs

Now we have to make some directories that do not come with the standard distribution of QFS, then sync the software out to the slaves. We sync the software before creating the configuration files since we have to have different configuration files between the master and slave nodes.

mkdir /usr/local/qfs/conf
mkdir /usr/local/qfs/sbin
parallel-ssh -i -h ~odroid/cluster/all.txt -l root "mkdir -p /data/qfs/logs"
parallel-ssh -i -h ~odroid/cluster/all.txt -l root "mkdir -p /data/qfs/checkpoint"
parallel-ssh -i -h ~odroid/cluster/slaves.txt -l hduser "mkdir /data/qfs/chunk"
parallel-ssh -i -h ~odroid/cluster/all.txt -l root "chown -R hduser:hadoop /data/qfs"
rsync -avz /opt/qfs-ubuntu-16.04.3-1.2.1-armv7l/ root@slave1:/opt/qfs-ubuntu-16.04.3-1.2.1-armv7l
rsync -avz /opt/qfs-ubuntu-16.04.3-1.2.1-armv7l/ root@slave2:/opt/qfs-ubuntu-16.04.3-1.2.1-armv7l
rsync -avz /opt/qfs-ubuntu-16.04.3-1.2.1-armv7l/ root@slave3:/opt/qfs-ubuntu-16.04.3-1.2.1-armv7l
parallel-ssh -i -h ~odroid/cluster/slaves.txt -l root "chown -R hduser:hadoop /opt/qfs-ubuntu-16.04.3-1.2.1-armv7l"
parallel-ssh -i -h ~odroid/cluster/slaves.txt -l root "ln -s /opt/qfs-ubuntu-16.04.3-1.2.1-armv7l /usr/local/qfs"
parallel-ssh -i -h ~/cluster/all.txt -l root "apt-get update"
parallel-ssh -i -h ~/cluster/all.txt -l root "apt-get install libboost-regex-dev -y"

Now install configuration and tools from the odroid-xu4-cluster Github repository. Note that the configuration found in this repository assumes that the XU4 cluster is configured as previously described, including checking out the git repository.

cd ~/odroid-xu4-cluster/
git pull
cp qfs/sbin/* /usr/local/qfs/sbin/
cp qfs/configuration/* /usr/local/qfs/conf/
parallel-scp -h ~odroid/cluster/slaves.txt -l hduser qfs/configuration/Chunkserver.prp /usr/local/qfs/conf/

Let’s review the various configuration files. The set of configurations that went on the master node controls the set up of Meta Server, which is the equivalent to a Name Node in HDFS. First there is Metaserver.prp configuration file:

metaServer.clientPort = 20000
metaServer.chunkServerPort = 30000
metaServer.logDir = /data/qfs/logs
metaServer.cpDir = /data/qfs/checkpoint
metaServer.recoveryInterval = 30
metaServer.clusterKey = qfs-odroid-xu4
metaServer.rackPrefixes = 10.10.10.2 1   10.10.10.3 2   10.10.10.4 3
metaServer.msgLogWriter.logLevel = INFO
chunkServer.msgLogWriter.logLevel = NOTICE
metaServer.rootDirMode = 0777
metaServer.rootDirGroup = 1001
metaServer.rootDirUser = 1001

The last two lines are of particular interest as they set the UID and GID of the user that writes to file system. QFS has the option of enforcing permissions through security schemes like Kerberos. However we are not going to use that level of authentication to the file system, but QFS still needs to know a user that files are to be written under. Here I configured things to use the hduser and hadoop group on the master server. You should check whether your installation has the same UID and GID for those users as mine did. To check:

id -u hduser
id -g hduser

If your UID and GID are different, then update the configuration file accordingly. Another configuration file on master node is the wedUI.cfg file which controls the monitoring UI for QFS:

[webserver]
webServer.metaserverHost = 10.10.10.1
webServer.metaserverPort = 20000
webServer.port = 20050
webServer.docRoot = /usr/local/qfs/webui/files/
webServer.host = 0.0.0.0
webserver.allmachinesfn = /dev/null

Finally, there is the qfs-env.sh file which is used to set key environment variables to be used when launching QFS using the start-qfs.sh script.

#!/bin/bash

#
# QFS specific environment variables
#

export QFS_LOGS_DIR=/data/qfs/logs

On each of the slave nodes we installed the configuration file for the Chunk Servers, Chunkserver.prp, which hold the blocks of data that gets saved to QFS. It’s configuration file contains:

chunkServer.metaServer.hostname = 10.10.10.1
chunkServer.metaServer.port     = 30000
chunkServer.clientPort = 20000
chunkServer.chunkDir = /data/qfs/chunk
chunkServer.clusterKey = qfs-odroid-xu4
chunkServer.stdout = /dev/null
chunkServer.stderr = /dev/null
chunkServer.ioBufferPool.partitionBufferCount = 65536
chunkServer.msgLogWriter.logLevel = INFO
chunkServer.diskQueue.threadCount = 2

You will notice that both the Metaserver.prp and Chunkserver.prp files contains a variable named chunkServer.clusterKey. This variable’s value is simply a string that internally identifies the instance of QFS the configuration file pertains to. On very large clusters, it is not uncommon to run multiple instances of a distributed file system. The typical pattern is to have a read-only instance that contains the “production data”, and then a read-write instance where jobs can do work. QFS was designed to permit many different instances of file systems to run on the same cluster and enables it through having each process know which file system instance it belongs to, as identified by the chunkServer.clusterKey value. Though we are running only one instance of QFS, we still need to set the cluster key so that the chunk servers will connect properly to the meta server.

We now have QFS fully installed. We need to initialize the Meta Server (note that we only ever do this once):

/usr/local/qfs/bin/metaserver -c /usr/local/qfs/conf/Metaserver.prp

To launch QFS, use the launch script installed from the git repository:

/usr/local/qfs/sbin/start-qfs.sh

Putting Data onto QFS

Let’s put the concatenated blog dat that we constructed previously onto our QFS instance. Note that these are fairly verbose commands.

/usr/local/qfs/bin/tools/qfsshell -s master -p 20000 -q -- mkdir /user/michael/data
/usr/local/qfs/bin/tools/cptoqfs -s master -p 20000 -d ~/blogs_all.xml -k /user/michael/data/blogs_all.xml -r 2

More documentation on other QFS commands available and how to configure things to be less verbose can be found at the QFS wiki.

Installing Spark

If you already installed Spark in accordance to my last post on installing Spark with Hadoop, do not skip this section as you will need to install a different version of Spark again. The reason for this is that QFS provides bindings that allows clients to interact with it as if it were HDFS. These instructions are written assuming that you do not have Spark already installed to work with Hadoop. If you wish to keep both use cases for Spark available, you will need to install spark into a different directory and have a different symlink in /usr/local. I leave it as an exercise to the reader to update these instructions as appropriate to accomplish that.

The first step is to download and install a version of Spark that was compiled to work with the ARM71 processor:

cd /opt
sudo wget http://diybigdata.net/downloads/spark/spark-2.2.0-bin-hadoop2.7-arm71-double-alignment.tgz
sudo tar xvzf spark-2.2.0-bin-hadoop2.7-arm71-double-alignment.tgz 
sudo chown -R hduser:hadoop spark-2.2.0-bin-armhf
sudo ln -s /opt/spark-2.2.0-bin-armhf /usr/local/spark

Now we need to configure Spark to run in standalone cluster mode:

cd /usr/local/spark/conf
cp spark-env.sh.template spark-env.sh
vi spark-env.sh

Add the following lines somewhere in the spark-env.sh file:

LD_LIBRARY_PATH=/usr/local/qfs/lib
SPARK_DIST_CLASSPATH=/usr/local/qfs/lib/hadoop-2.7.2-qfs-1.2.1.jar:/usr/local/qfs/lib/qfs-access-1.2.1
SPARK_HOME=/usr/local/spark/

PYSPARK_PYTHON=python3
PYTHONHASHSEED=12121969
HADOOP_CONF_DIR=/usr/local/spark/conf/
SPARK_LOCAL_DIRS=/data/spark
SPARK_WORKER_MEMORY=1300M
SPARK_WORKER_CORES=2

Next we need to set up the defaults configuration values that Spark uses.

cp spark-defaults.conf.template spark-defaults.conf
vi spark-defaults.conf

Add the following lines to the file:

spark.master                    spark://master:7077
spark.serializer                org.apache.spark.serializer.KryoSerializer
spark.executor.memory           1300M
spark.driver.memory             1000M
spark.io.compression.codec      lz4
spark.driver.cores              2
spark.executor.cores            1

# These settings help prevent Spark from thinking the ODROID XU4 node
# is failing when a thread gets slowed down etither by the governor or
# getting on a slower core.
#
# spark.akka.timeout is only needed pre-Spark 2.0
#spark.akka.timeout             200

spark.network.timeout           800
spark.worker.timeout            800

# This setting is to tell the class loaders in Spark that they
# only need to load the QFS access libraries once 

spark.sql.hive.metastore.sharedPrefixes         com.quantcast.qfs

We need to tell spark where the slave nodes are.

cp slaves.template slaves
vi slaves

Add the following lines to the slaves file:

slave1
slave2
slave3

You will note that in the spark-defaults.conf file we set the HADOOP_CONF_DIR to this Spark installation’s configuration directory. The reason for that is we cannot use our original Hadoop configuration here as it points to HDFS. Instead, we need to tell Spark to use QFS. To do that, create a core-site.xml file in this Spark installation’s configuration directory:

vi core-site.xml

And set the contents to:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Setting for QFS-->

<configuration>
  <property>
    <name>fs.qfs.impl</name>
    <value>com.quantcast.qfs.hadoop.QuantcastFileSystem</value>
  </property>
  <property>
    <name>fs.defaultFS</name>
    <value>qfs://master:20000</value>
  </property>
  <property>
    <name>fs.qfs.metaServerHost</name>
    <value>master</value>
  </property>
  <property>
    <name>fs.qfs.metaServerPort</name>
    <value>20000</value>
  </property>
</configuration>

The final step is to copy the Spark software and configuration to each of the slaves. As the odroid user on the master node:

rsync -avxz /opt/spark-2.2.0-bin-armhf/ root@slave1:/opt/spark-2.2.0-bin-armhf
rsync -avxz /opt/spark-2.2.0-bin-armhf/ root@slave2:/opt/spark-2.2.0-bin-armhf
rsync -avxz /opt/spark-2.2.0-bin-armhf/ root@slave3:/opt/spark-2.2.0-bin-armhf
parallel-ssh -i -h ~odroid/cluster/slaves.txt -l root "chown -R hduser:hadoop /opt/spark-2.2.0-bin-armhf"
parallel-ssh -i -h ~odroid/cluster/slaves.txt -l root "ln -s /opt/spark-2.2.0-bin-armhf /usr/local/spark"

We now have an installation of Spark set up to work with QFS.

Installing Jupyter

You may skip this section of you previously installed Jupyter base on the previous post.

Installing Juypyter Notebooks is relatively simple. As hduser on the master node:

sudo mkdir /data/jupyter
sudo chown -R hduser:hadoop /data/jupyter/
mkdir ~/notebooks
sudo apt-get install python3-pip
sudo pip3 install jupyter

Running Word Count

Before running the word count program on Spark with the QFS data store, we need to launch spark than Jupyter (assuming that QFS and Spark are already running). Note that you should replace the IP address in the --ip parameter with the external IP address of your cluster.

/usr/local/spark/sbin/start-all.sh 
XDG_RUNTIME_DIR="/data/jupyter" PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-browser --ip 192.168.1.50 --port=7777 --notebook-dir=/home/hduser/notebooks" /usr/local/spark/bin/pyspark --master spark://master:7077

Note the use of the Spark installation set up for QFS. Once it is running, point your computer’s browser to http://<cluster IP address>:7777/. You will see the Jupyter interface. Create a new notebook and enter the word count program using the QFS data source into the first cell and press shift-enter:

from operator import add
blog_data = sc.textFile('qfs://master:20000/user/michael/data/blogs_all.xml').cache()
counts = blog_data.flatMap(lambda x: x.split(' ')).map(lambda x: (x, 1)).reduceByKey(add)
counts.take(100)

On my system, the cluster took 2.3 minutes to complete word count versus 2.8 minutes with HDFS as we previously set it up. While this is certainly a speed up, the real reason QFS was faster in this case is because the original run with HDFS was using a block size of 16 MB in HDFS. This means there were more part files for HDFS to read and Spark to process, which created about a half of minute of additional overhead. I changed the HDFS block size to 64 MB, which is what QFS is hard coded to use as a block size (you can only change QFS’s block size if you change the code and recompile), and the Spark on HDFS time for this problem sped up to 2.3 minutes. As I have previously pointed out, reducing the number of blocks processed has significant impact in map/reduce technologies like Spark and Hadoop.

It is also worth noting that the small ODROID XU4 cluster doesn’t really leverage QFS’s full potential. The is because QFS is designed to address the inefficiencies that happen at very large scale use. For example, if we had at least chunk server 9 nodes, QFS could use Reed-Solomon encoding of data which makes more efficient the data replication and recovery process.

Shutting Things Down

To shut down Jupyter, Spark, and QFS, first close and halt any open notebooks in Jupyter, then press control-C twice in the terminal that launched Jupyter. Now run the shutdown scripts for Spark and QFS:

/usr/local/spark/sbin/stop-all.sh
/usr/local/qfs/sbin/stop-qfs.sh

DIY Big Data

One byte at a time …