Mounting HDFS with NFS

After setting up the Hadoop installation on the ODROID XU4 cluster, we need to find a way to get data in and out of it. The traditional pattern used when a cluster is on it’s own network such as ours is is to have an edge node where the user logs into, transfers the data to that edge node, then put that data in HDFS from the edge node. Speaking from experience, this is annoyingly to much work. For my personal cluster, I want the HDFS file system to integrate with my Mac laptop. The most robust way to accomplish my goal with HDFS is to have it mounted as a NFS drive. The Hadoop distribution we are using has a NFS server built in. This server is run on the master node, effectively acting as a proxy between the HDFS cluster and the external network. The pros to this approach is that I get the usage paradigm that I want. Read More …

Installing Hadoop onto an ODROID XU4 Cluster

Now is the time when we start to see the fruits of our labor in getting the ODROID XU4 low cost cluster built. We will be installing Hadoop and configuring it to serve an NFS mount that can be mounted on your client computer (e.g., your laptop) to be able to interact with the HDFS file system as if it were another hard drive on your computer. This feature will greatly ease the use of our cluster, as it will minimize the need for a user to log into the cluster to use it. An NFS mount is not the only necessary facet of the cluster to enable the client usage vision, but it is an important one. Before we install Hadoop, let’s discuss what we are trying to accomplish by installing it. Hadoop has three components: the Hadoop File System (HDFS), Yarn, and Map-Reduce. For our purposes, we are most interested in HDFS, but we will play around with the Read More …

Adding the MicroSD Data Drives to the ODROID XU4 Cluster

We are nearly done with the basic set up of the ODROID XU4 four node cluster. The final step is to configure the MicroSD cards that were in the bill of materials as data drives for each node. Before we start, ensure that the cluster is completely powered down and remove the power cords from each node. We will be removing and adding components to the ODROID XU4’s, and it is just best practice to have no power to the device when removing or adding components. The first step is to format each MicroSD card with the ext4 filesystem. Each platform has it’s own instructions for how to do this. I will provide the Mac OS X instructions since that is what I use. In the terminal, do the following: If you don’t have it already, install the Homebrew package management app for OS X. Visit the brew.sh website for instructions. Install the e2fsprogs utility using brew install e2fsprogs With the MicroSD card attached to your computer, Read More …

Configuring DHCP and NAT in ODROID XU4 Cluster

UPDATE – I have rebuilt this cluster to use Ubuntu 16.04. You can find updated instruction for Ubuntu 16.04 here. As was discussed in the network design post, we will set up the master node as a router to manage network traffic in and out of the cluster.  Before starting, ensure that all of the slave nodes have been powered down, that your home network is still connected directly to the open port on the cluster’s ethernet switch, that you have collected each node’s MAC address, and that the master node is powered up and you are logged into it via SSH. The first step is to explicitly set up the networking interfaces for both the eth0 and eth1 device on the master node. Note that by default the Ubuntu system that was installed on the master node treats eth0 a as requesting a DHCP lease on the network it is attached to. This is why it got an IP Read More …

Configuring the ODROID XU4 Operating System

UPDATE – This post was originally written for HardKernel’s distribution of Ubuntu 15.10, but now has been changed to use the ODROID server image for Ubuntu 14.04 LTS, which is available from HardKernel here. The motivation for this change was to use an official HardKernel support distribution that was specifically built for headless server application. Ubuntu 14.04 LTS may be an older distribution, but it works for our purposes. UPDATE 2 – I have since rebuilt this cluster to use Ubuntu 16.04. You can find updated instruction here. Temporary Networking Setup When setting up the nodes initially, you will need to SSH into them to configure their settings. However, if we go straight to our network design, we will not be able to connect to any node between the master node is not yet set up as a router. So we will need to connect each node directly to the external network (e.g., you home network). Since the bill of materials called Read More …

Network Design for the Low Cost Cluster

Our first task in building any cluster is to first design how it will be set up, most notably how the nodes will interact with each other. The cluster we will be building will have 4 nodes, one master node and three slaves. Each node will be connected to each other by the ethernet switch. However, we want the node-to-node communication to be it’s own network. This maximizes the throughput in the node-to-node communication, which is important for distributed computation, and also makes the cluster behave more like a single device to the external network. The benefit of doing this is that we can add or remove nodes to the cluster without a client ever knowing. However, this approach does present some challenges with how an external client (e.g., your laptop) will interact with the data analysis software, such as Hadoop, but we will deal with that later. My goal is really to create a “data analysis appliance”, so requiring any client Read More …

Building the ODROID XU4 Low Cost Cluster

My shipment of ODROID XU4s that I ordered for my low cost cluster came into today. So I set out to assemble the cluster. My daughter “helped” me with this, making it a fun family activity. First thing I did was layout the nodes. Though I knew these board would be small, it did impress me just how small they were. The first task is to attach the eMMC drive to each node. Note that I eventually had to undo this as I later discovered I needed to flash the drives with the latest Ubuntu build. But given the amount of space the PCB standoffs five, removing and reattaching the eMMC drive is not hard. Then I attached the XU4s together with the PCB standoffs I bought. The short standoffs are for the “feet” to the stack, and the long ones to separate each node. Once I stacked all of the nodes, I attached the ethernet switch and the cabling. I Read More …

Hardware Selection for a Low Cost Cluster

For this first phase of my project, I will be building a cluster on the cheap. My cost goal was $600 for a 4 node system, or $150 per node. The Raspberry Pi immediately comes to mind as an option. Indeed, there have been several projects where people turned a set of Raspberry Pis into a cluster. However, since my goal is to create a cluster for Spark, the size of the aggregate RAM pool of the cluster is important. The Raspberry Pi Model 3 only has 1 GB of RAM. So I explored whether there are other options with more RAM. I found the ODROID-XU4 single board computer, which has 2 GB of RAM, and an 8-core ARM CPU. Furthermore, each board is only $76 without storage. It also has some other nice features, such as: USB 3 ports HDMI output eMMC 5.0 hard drive connector MicroSD slot supporting the faster UHS-1 standard Onboard gigabit ethernet Serial console port One consideration is data storage. Read More …

DIY Big Data Project Goal

I have always been interested in large scale computing. This interest started back in graduate school when I was studying astrodynamics. Most of the problems you want to solve in astrodynamics requires numerical computation, as only the 2-body problem has a closed form solution. In order to solve anything more complex, you first have to make some simplifying assumptions (e.g., no other gravitational influences), and then you have a set of partial differential equations that will describe the motion of your system. The only way to use these equations to predict the position of the bodies was through numerical integration. In simple systems this sort of calculations was pretty straightforward, though you have to pay a lot of attention to round off error. However, when if you were trying to predict the progression of a debris field in space, which would require you to simultaneously project the motion of tens of thousands of objects, parallelized computation start to look attractive. Compute Read More …