Personal Compute Cluster – 2019 Edition

This project walks you through all the steps necessary to create a distributed compute cluster for personal productivity and the exploration of big data and machine learning technologies such as Hadoop, Spark, and ElasticSearch. This cluster will leverage the very economical EGLOBAL S200 mini computer and have a total of 48 thread and 256 GB of RAM across four nodes, making it ideal for running moderately sized Apache Spark jobs to analyze data sets on the scale of a terabyte in size with reasonable productivity.

Building the Cluster

Below are the steps to obtain, setup, and configure a Personal Compute Cluster.

Design and Hardware Selection
Hardware Construction
Operating System Installation and Configuration
Setting up a Docker Swarm
Creating a GlusterFS Volume

Installing Data Analysis Software

Deploying Spark with GlusterFS as a Stack on a Docker Swarm
Deploying Spark with the Quantcast File System (QFS) on a Docker Swarm

Cluster Performance

PySpark Benchmark

More to come …