Since I have set up my ODROID XU4 cluster to work with Spark from a Jupyter web notebook, one of the little annoyances I have had is how inefficient it was to transferring data into the HDFS file system on the cluster. It involved downloading data to my Mac laptop via a web download, transferring that data to the master node, then running hdfs dfs -put
to place the data onto the file system. While I did set up my cluster to create an NFS mount, it can be very slow when transferring large files to HDFS. So, my solution to this was to install HDFS on my Mac laptop, and configure it to use the ODROID XU4 cluster.
One of the requirements for an external client (e.g., your Mac laptop) to interact with HDFS on a cluster is that the external client must be able to connect to every node in the cluster. Since the master node of the cluster was set up as a gateway between the cluster’s internal network and the external network the cluster is connected to, all you have to do is to declare a route to the internal network through the master node gateway. On a Mac, you would use this command:
sudo route add 10.10.10.0/24 192.168.1.50
Where the last IP address is the master node’s external IP address (the one bound to the USB ethernet dongle). Furthermore, you will need to update the /etc/hosts
file on your client computer to be able to address the nodes by name. The following lines should be added:
10.10.10.1 master 10.10.10.2 slave1 10.10.10.3 slave2 10.10.10.4 slave3
Finally, install Hadoop onto you client computer and grab the configuration from the XU4 cluster:
cd /path/to/where/you/want/hadoop wget http://apache.claz.org/hadoop/common/hadoop-2.7.2/hadoop-2.7.2.tar.gz tar xvzf hadoop-2.7.2.tar.gz cd hadoop-2.7.2/etc/hadoop scp hduser@master:/usr/local/hadoop/etc/hadoop/hdfs-site.xml . scp hduser@master:/usr/local/hadoop/etc/hadoop/masters . scp hduser@master:/usr/local/hadoop/etc/hadoop/slaves . scp hduser@master:/usr/local/hadoop/etc/hadoop/core-site.xml .
Now add /path/to/where/you/want/hadoop/hadoop-2.7.2/bin
to your PATH
environment variable (of course, use the correct path of where you installed Hadoop).
To use the HDFS command from your client computer’s terminal, note that you have to tell Hadoop what user to perform the command as on the cluster. Since the XU4 cluster was set up to have everything operate through the user hduser
, we would use a command like this to put a file onto the Xu4 cluster:
HADOOP_USER_NAME=hduser hdfs dfs -put ./my_file.txt /user/michael/data/my_file.txt
You can always declare the HADOOP_USER_NAME
environment variable in your shell configuration file to save a little typing in the Terminal.