setup – DIY Big Data

Setting up a Docker Swarm on the Personal Compute Cluster

September 2, 2019April 17, 2020 michael

The last time I set up a Spark cluster, I installed Spark manually and configured each node directly. For the small scale cluster I had, that was fine. This time time the cluster is still relatively small scale. However, I do want to take advantage of a prebuilt containers, if possible. The standard for that is Docker. So in this post I will set up a Docker Swarm on the Personal Compute Cluster. Installing Docker Before we begin, if you set up SETI@Home as the end of the last post, we need to stop it first (skip if you didn’t set up SETI@Home): We need to ensure that swap is off, as various applications we will be working with later do not like it: Now we install all the needed software: The installation of the software should have created the user group docker. Now we need to add our user account to that group so we can run docker without Read More …

Building Out Nodes for Personal Cluster

September 1, 2019April 17, 2020 michael

Its time to build the cluster. I have all the parts described for my cluster in my prior post, and today’s goal is to get the computers set and fingered as described in my cluster network design. This is a rather long post as I cover both hardware and operating system set up. At the end of this process, we will have a cluster of computers that is ready to have application software installed. I suggest building all the nodes first before starting with the operating system installation. Before you get started with these steps, I highly recommend that you start the download for the Ubuntu 18.04 LTS Server installation ISO. Note that the EGLOBAL S200 computers couldn’t use the live install ISO for some reason, but everything worked just find with the alternative (or old style) Ubuntu installer. You can download the latest version of that installer here. Select the “64-bit PC (AMD64) server install image” version. Cluster Node Read More …

Project Kickoff – Personal Compute Cluster 2019 Edition

September 1, 2019January 5, 2020 michael

Three years ago, I worked through a project of creating a low cost, low power computer cluster, with the primary goal of becoming more familiar with the inner works of Apache Spark. This project did accomplish that goal, but since this cluster was made up of 32-bit ARM processors and each had only 2 GB of RAM, the cluster was not too useful for getting meaningful work done. What it did excel at, however, was showing the user how to write Spark code efficiently. If you wanted to get anything done on this constrained system you had to be mindful of every inefficiency. Jump ahead to 2019, and I have decided to give it another go. This time, I want to make a cluster that is moderately useful for data analysis and machine learning, but does not break the bank either. The first step is to document my system design requirements and general goals: The cluster’s primary purpose will be Read More …

Quantcast File System 1.2 for ARM71

November 24, 2016December 2, 2017 michael

NOTE – This article has been updated. It now assumes you have set up the cluster with Ubuntu 16.04, and it has the latest builds of QFS v1.2.1 and Spark v2.2.0. I have been using the Quantcast File System (QFS) as my primary distributed file system on my ODROID XU4 cluster. Due to QFS’s low memory footprint, it works well with Spark, allowing me to assign as much of the ODROID XU4’s limited 2 GB RAM footprint to the Spark executor running on a node. Recently, QFS 1.2 was released. This version brings many features and updates, many not relevant to my ODROID cluster use case. However, the most notable updates relevant to the ODROID XU4 cluster include: Correct Spark’s ability to create a hive megastore on a new QFS instance (QFS-332) Improved error reporting in the QFS/HDFS shim HDFS shim for the Hadoop 2.7.2 API, which the latest versions of Spark use. In this post, I will update the ODROID XU4 Read More …

Adding a New Node to the ODROID XU4 Cluster

August 20, 2016August 28, 2016 michael

I recently acquired another ODROID XU4 device (and a MicroSD card for bulk storage) to add it to my XU4 cluster. This new node brings my node count to five. Adding a new node to the cluster is relatively straight forward, but there are a lot of details. In the modern datacenter, this operation would be accomplished through a package manager, which would build the new node according to an image. However, I haven’t set up package management on my cluster so I will need to sit up the new node manually. In the original cluster configuration, I used a 5-port ethernet switch for the internal network. Given that there was an open port, no additional hardware beyond the new node is needed. However, if I were to add a sixth node (or beyond), I would need to update my ethernet switch to something such as an 8-port switch. I will also note that using the 40 mm PCB spacers I originally ordered makes Read More …

Connecting to HDFS from an computer external to the cluster

August 13, 2016August 13, 2016 michael

Since I have set up my ODROID XU4 cluster to work with Spark from a Jupyter web notebook, one of the little annoyances I have had is how inefficient it was to transferring data into the HDFS file system on the cluster. It involved downloading data to my Mac laptop via a web download, transferring that data to the master node, then running hdfs dfs -put to place the data onto the file system. While I did set up my cluster to create an NFS mount, it can be very slow when transferring large files to HDFS. So, my solution to this was to install HDFS on my Mac laptop, and configure it to use the ODROID XU4 cluster. One of the requirements for an external client (e.g., your Mac laptop) to interact with HDFS on a cluster is that the external client must be able to connect to every node in the cluster. Since the master node of the cluster was Read More …

Mounting HDFS with NFS

June 26, 2016June 26, 2016 michael

After setting up the Hadoop installation on the ODROID XU4 cluster, we need to find a way to get data in and out of it. The traditional pattern used when a cluster is on it’s own network such as ours is is to have an edge node where the user logs into, transfers the data to that edge node, then put that data in HDFS from the edge node. Speaking from experience, this is annoyingly to much work. For my personal cluster, I want the HDFS file system to integrate with my Mac laptop. The most robust way to accomplish my goal with HDFS is to have it mounted as a NFS drive. The Hadoop distribution we are using has a NFS server built in. This server is run on the master node, effectively acting as a proxy between the HDFS cluster and the external network. The pros to this approach is that I get the usage paradigm that I want. Read More …

Installing Hadoop onto an ODROID XU4 Cluster

June 25, 2016August 19, 2016 michael

Now is the time when we start to see the fruits of our labor in getting the ODROID XU4 low cost cluster built. We will be installing Hadoop and configuring it to serve an NFS mount that can be mounted on your client computer (e.g., your laptop) to be able to interact with the HDFS file system as if it were another hard drive on your computer. This feature will greatly ease the use of our cluster, as it will minimize the need for a user to log into the cluster to use it. An NFS mount is not the only necessary facet of the cluster to enable the client usage vision, but it is an important one. Before we install Hadoop, let’s discuss what we are trying to accomplish by installing it. Hadoop has three components: the Hadoop File System (HDFS), Yarn, and Map-Reduce. For our purposes, we are most interested in HDFS, but we will play around with the Read More …

Adding the MicroSD Data Drives to the ODROID XU4 Cluster

June 20, 2016August 19, 2016 michael

We are nearly done with the basic set up of the ODROID XU4 four node cluster. The final step is to configure the MicroSD cards that were in the bill of materials as data drives for each node. Before we start, ensure that the cluster is completely powered down and remove the power cords from each node. We will be removing and adding components to the ODROID XU4’s, and it is just best practice to have no power to the device when removing or adding components. The first step is to format each MicroSD card with the ext4 filesystem. Each platform has it’s own instructions for how to do this. I will provide the Mac OS X instructions since that is what I use. In the terminal, do the following: If you don’t have it already, install the Homebrew package management app for OS X. Visit the brew.sh website for instructions. Install the e2fsprogs utility using brew install e2fsprogs With the MicroSD card attached to your computer, Read More …

Configuring the ODROID XU4 Operating System

June 20, 2016November 28, 2017 michael

UPDATE – This post was originally written for HardKernel’s distribution of Ubuntu 15.10, but now has been changed to use the ODROID server image for Ubuntu 14.04 LTS, which is available from HardKernel here. The motivation for this change was to use an official HardKernel support distribution that was specifically built for headless server application. Ubuntu 14.04 LTS may be an older distribution, but it works for our purposes. UPDATE 2 – I have since rebuilt this cluster to use Ubuntu 16.04. You can find updated instruction here. Temporary Networking Setup When setting up the nodes initially, you will need to SSH into them to configure their settings. However, if we go straight to our network design, we will not be able to connect to any node between the master node is not yet set up as a router. So we will need to connect each node directly to the external network (e.g., you home network). Since the bill of materials called Read More …

One byte at a time …

Tag: setup