Installing Hadoop onto an ODROID XU4 Cluster

Now is the time when we start to see the fruits of our labor in getting the ODROID XU4 low cost cluster built. We will be installing Hadoop and configuring it to serve an NFS mount that can be mounted on your client computer (e.g., your laptop) to be able to interact with the HDFS file system as if it were another hard drive on your computer. This feature will greatly ease the use of our cluster, as it will minimize the need for a user to log into the cluster to use it. An NFS mount is not the only necessary facet of the cluster to enable the client usage vision, but it is an important one.

Before we install Hadoop, let’s discuss what we are trying to accomplish by installing it. Hadoop has three components: the Hadoop File System (HDFS), Yarn, and Map-Reduce. For our purposes, we are most interested in HDFS, but we will play around with the other two. What HDFS will do for us is turn the MicroSD cards installed onto every node into a single “virtual drive” that files can be written to for use by the cluster’s analytics applications. HDFS will (if appropriate) break a large file up into parts and then distribute those parts around the cluster. The benefit of this is that when doing distributed computing operations on the file, work will be split up with each node being responsible for processing a set of parts.

Installing Java

Hadoop is written in Java, so the first thing we need to do is install Java on all nodes in the cluster. Run the following commands on each node in the cluster to install Oracle’s latest version of Java 8. Note that there will be some interactive steps in the process requiring you to accept Oracle’s license. Also, install the useful utility rsync.

sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java8-installer
sudo apt-get install rsync

Getting the Hadoop Software

Hadoop can be downloaded from the Apache Hadoop site. However, the version you will download there has some issues with running on our ODROID XU4 cluster. Specifically, it contains native libraries built for the x86 processor which were built to make certain operations run faster than they would in Hadoop’s main implementation language of Java. The x86 build of Hadoop will work in that Hadoop will automatically fall back to the slower Java implementation. But, we want our installation to be as fast as possible, so Hadoop needs to be built for the XU4’s ARM processor with hard float capabilities.

I am not going to document the Hadoop build process here. There is a good article describing how to build Hadoop on a Raspberry Pi here. The process for the XU4 is pretty much identical. Build Hadoop on the master node. We will <code>rsync</code> it to the rest of the cluster after everything is configured.

If you want to skip the build process, you can download from my GitHub repository. To directly download that build to your master node, use this command for the Hadoop 2.7.2 build:

Download Now!

Then place this file into the /opt directory on the master node. Alternatively, you can download it directly to the master node:

cd /opt
sudo wget http://diybigdata.net/downloads/hadoop/hadoop-2.7.2.armhf.tar.gz

Preparing the Cluster for Hadoop

Our first step to installing Hadoop is to set up the hduser account on all our nodes and enable passwordless logins from the master node to all slaves. While logged in as user odroid on the master node:

parallel-ssh -i -h ~odroid/cluster/all.txt -l root "addgroup hadoop"

Then you will need to do the following logged into each node one by one as parallel-ssh doesn’t handle the interactive nature of creating a user. Note that you will be asked to create a password for the hduser. I simply used the same one that account odroid uses.

sudo adduser --ingroup hadoop hduser
sudo adduser hduser sudo

Then, back on the master node, give hduser super user rights and distribute the master node’s SSH keys:

parallel-ssh -i -h ~odroid/cluster/all.txt -l root "usermod -aG sudo hduser"
su hduser
cd
ssh-keygen -t rsa -P ""
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
ssh hduser@localhost
exit
ssh hduser@master
exit
ssh-copy-id hduser@slave1
ssh hduser@slave1
exit
ssh-copy-id hduser@slave2
ssh hduser@slave2
exit
ssh-copy-id hduser@slave3
ssh hduser@slave3
exit

Finally, let’s create the Hadoop data directory in our /data mount. Under the odroid user, issue:

parallel-ssh -i -h ~odroid/cluster/all.txt -l root "mkdir -p /data/hdfs/tmp"
parallel-ssh -i -h ~odroid/cluster/all.txt -l root "chown -R hduser:hadoop /data/hdfs"

Installing and Configuring Hadoop

Unpack the Hadoop package in the /opt directory. I also like creating a symlink to the install from the /usr/local directory. On the master node:

cd /opt
sudo tar xzf hadoop-2.7.2.armhf.tar.gz  
sudo chown -R hduser:hadoop hadoop-2.7.2
cd /usr/local
sudo ln -s /opt/hadoop-2.7.2 hadoop

The next step is to configure the Hadoop installation on the master node. This requires editing several configuration files found the the /usr/local/hadoop/etc/hadoop directory. I’ve posted the contents of these files to the GitHub repository for the ODROID CU4 cluster project.

The first configuration file is hadoop-env.sh:

vi etc/hadoop/hadoop-env.sh

Adjust the following two lines to match what is shown:

export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")

export HADOOP_HEAPSIZE=384

These changes tell Hadoop where it can find the Java libraries and sets the default heap size. Since our devices have 2 GB of RAM and we want to save as much RAM as possible for a Apache Spark install later, we are limiting Hadoop’s heap to 384 MB.

Verify that Hadoop’s environment is (minimally) set up:

bin/hadoop

The next configuration file sets up the core variables for Hadoop across all components of Hadoop, the core-site.xml:

vi etc/hadoop/core-site.xml

Add the following values between the <configuration></configuration> tags.

To configure the HDFS component, edit hdfs-site.xml:

vi etc/hadoop/hdfs-site.xml

Add the following values between the <configuration></configuration> tags.

To configure the Map-Reduce component, edit mapred-site.xml:

cp etc/hadoop/mapred-site.xml.template etc/hadoop/mapred-site.xml
vi etc/hadoop/mapred-site.xml

Add the following values between the <configuration></configuration> tags. Note that many of these configurations items control the memory allocated during a map-reduce job.

To configure the resource manager YARN, edit yarn-site.xml:

vi etc/hadoop/yarn-site.xml

Add the following values between the <configuration></configuration> tags. Note that many of these configurations items control the memory allocated during a map-reduce job.

Create and edit the masters and slaves files:

vi etc/hadoop/masters

Set the masters file contents to:

master

Now open the slaves file for editing:

vi etc/hadoop/slaves

Set the slaves file contents to:

master
slave1
slave2
slave3

Installing Configured Hadoop to All Slaves

Go back to the odroid account and use rsync to push the configured Hadoop installation to each slave.

exit
parallel-ssh -i -h ~/cluster/slaves.txt -l root "mkdir -p /opt/hadoop-2.7.2/"
sudo rsync -avxP /opt/hadoop-2.7.2/ root@slave1:/opt/hadoop-2.7.2/
sudo rsync -avxP /opt/hadoop-2.7.2/ root@slave2:/opt/hadoop-2.7.2/
sudo rsync -avxP /opt/hadoop-2.7.2/ root@slave3:/opt/hadoop-2.7.2/
parallel-ssh -i -h ~/cluster/slaves.txt -l root "chown -R hduser:hadoop /opt/hadoop-2.7.2/"
parallel-ssh -i -h ~/cluster/slaves.txt -l root "ln -s /opt/hadoop-2.7.2 /usr/local/hadoop"

Install see utilities to interact with the HDFS NFS mount:

sudo apt-get install nfs-common

Finally, update the master node’s .bashrc file for the hduser user to add Hadoop to the earth path:

su hduser
vi ~/.bashrc

And add these lines at the end:

export PATH=$PATH:/usr/local/hadoop/sbin:/usr/local/hadoop/bin

Starting and Stopping HDFS

Before you start HDFS for the first time, you need to format the HDFS NameNode. Ensure that you are logged into the master node as hduser for these commands.

hdfs namenode -format

Then to start HDFS and the NFS server:

start-dfs.sh

The Hadoop convention is that all your files will be in your user filer. Use this command to create it, but change the name to what you want:

hdfs dfs -mkdir -p /user/michael

To stop HDFS:

stop-dfs.sh

That’s it. You now have Hadoop installed and configured on your ODROID XU4 cluster. In the next post, we will explore how to connect to HDFS and what we can do with Hadoop.