Connecting to HDFS from an computer external to the cluster

Since I have set up my ODROID XU4 cluster to work with Spark from a Jupyter web notebook, one of the little annoyances I have had is how inefficient it was to transferring data into the HDFS file system on the cluster.  It involved downloading data to my Mac laptop via a web download, transferring that data to the master node, then running hdfs dfs -put  to place the data onto the file system. While I did set up my cluster to create an NFS mount, it can be very slow when transferring large files to HDFS. So, my solution to this was to install HDFS on my Mac laptop, and configure it to use the ODROID XU4 cluster.

One of the requirements for an external client (e.g., your Mac laptop) to interact with HDFS on a cluster is that the external client must be able to connect to every node in the cluster. Since the master node of the cluster was set up as a gateway between the cluster’s internal network and the external network the cluster is connected to, all you have to do is to declare a route to the internal network through the master node gateway. On a Mac, you would use this command:

sudo route add 10.10.10.0/24 192.168.1.50

Where the last IP address is the master node’s external IP address (the one bound to the USB ethernet dongle). Furthermore, you will need to update the /etc/hosts file on your client computer to be able to address the nodes by name. The following lines should be added:

10.10.10.1      master
10.10.10.2      slave1
10.10.10.3      slave2
10.10.10.4      slave3

Finally, install Hadoop onto you client computer and grab the configuration from the XU4 cluster:

cd /path/to/where/you/want/hadoop
wget http://apache.claz.org/hadoop/common/hadoop-2.7.2/hadoop-2.7.2.tar.gz
tar xvzf hadoop-2.7.2.tar.gz
cd hadoop-2.7.2/etc/hadoop
scp hduser@master:/usr/local/hadoop/etc/hadoop/hdfs-site.xml .
scp hduser@master:/usr/local/hadoop/etc/hadoop/masters .
scp hduser@master:/usr/local/hadoop/etc/hadoop/slaves .
scp hduser@master:/usr/local/hadoop/etc/hadoop/core-site.xml .

Now add /path/to/where/you/want/hadoop/hadoop-2.7.2/bin  to your PATH  environment variable (of course, use the correct path of where you installed Hadoop).

To use the HDFS command from your client computer’s terminal, note that you have to tell Hadoop what user to perform the command as on the cluster. Since the XU4 cluster was set up to have everything operate through  the user hduser , we would use a command like this to put a file onto the Xu4 cluster:

HADOOP_USER_NAME=hduser hdfs dfs -put ./my_file.txt /user/michael/data/my_file.txt

You can always declare the HADOOP_USER_NAME  environment variable in your shell configuration file to save a little typing in the Terminal.

Leave a Reply