Running Spark with QFS on a Docker Swarm

In my last post, we put together a quick build that runs Apache Spark on the Personal Compute Cluster. The goal of that build was to simply demonstrate how to run Spark on a Docker Swarm. What it did not address well was how Spark would access data. I did set up GlusterFS shared volume that did allow Spark to read and write data files. However, this was not fast because the set up did not allow Spark to take advantage of data locality when scheduling task. As a result, almost all data reads and writes are done across the ethernet backbone. The goal in this post is to set up a Spark build on Docker Swarm that is paired with a distributed file system set up in a manner where data locality can be leveraged to speed up processing. There are choices for distributed file systems, the most notable being Hadoop’s HDFS and the Quantcast File System (QFS). For Read More …