After setting up the Hadoop installation on the ODROID XU4 cluster, we need to find a way to get data in and out of it. The traditional pattern used when a cluster is on it’s own network such as ours is is to have an edge node where the user logs into, transfers the data to that edge node, then put that data in HDFS from the edge node. Speaking from experience, this is annoyingly to much work. For my personal cluster, I want the HDFS file system to integrate with my Mac laptop.
The most robust way to accomplish my goal with HDFS is to have it mounted as a NFS drive. The Hadoop distribution we are using has a NFS server built in. This server is run on the master node, effectively acting as a proxy between the HDFS cluster and the external network. The pros to this approach is that I get the usage paradigm that I want. The con is that it can lead to an unbalanced cluster. The reason for that is when an file is put onto HDFS, it almost always gets mostly put on the node that the client connected to. In a network layout with the clients (my laptop) us using the HDS software as a client and can see and connect to all the nodes in the cluster, the node connected to is effectively random, which in turn leads to a balanced cluster over time. However, the HDFS NFS Server always connects to the node it is running on. In our set up, this would be the master node. The Hadoop package does provide a solution to this: a cluster rebalancing process that over time will move file blocks around with the goal of keeping the cluster balanced.
Configuring NFS Service
We have originally set up our HDFS system to use simple permissions and security. What this means is that files are created with the user and group that was running the client which put the file on the file system. This works great when you are interacting with HDFS from the master node. However, if you are interacting with HDFS via the NFS mount on your laptop, the user and group ID likely do not match any users and groups on the master node. This mismatch will cause HDFS to throw a lot of permission errors, which in turn prevents you from using HDFS from your NFS mount.
There is a solution to this problem. The issue is solved by creating a user and group mapping file on the master node which tells the HDFS NFS service how to map the users and groups it may see on the client to the users and groups that HDFS knows about. So of the work for doing this is in the various configuration files for HDFS. If you installed the configuration files as described in the HDFS setup instructions, you should be good to go on that front.
This then leaves setting up the user and group mapping. The first step in doing this is to get the account UID and GID values for the account(s) on your computer(s) from which you will mount the NFS drive. I am going to give instructions for doing this on Mac OS X, as that is what I have. If you have a different type of computer, it should be easy to find instructions on the internet.
To get the UID for each account on Mac OS X, type the following into the Terminal:
dscl . -list /Users UniqueID
You will get the complete list of accounts on your Mac and their corresponding UID. Record the UID of the account(s) from which you intend to connect to HDFS’s NFS mount. Now find the primary GID for the user(s) with this command:
dscl . -list /Users PrimaryGroupID
Again, record the GID for the desired account(s). Note that for most Mac OS X setups, the primary group for user accounts is
staff which has a GID of 20. However, do check your set up with the command above.
Now that you have the UID(s) and GID(s) of the client accounts you intend to use, log into the master node in the ODROID XU4 cluster. Your first task is to find the UID and GID of the
hduser on the master node where the NFS mount will be proxied. To find the UID and then the GID:
id -u hduser id -g hduser
If you set up the cluster as I did, both the UID and primary GID of
hduser will be 1001. Now, create and edit the
sudo vi /etc/nfs.map
Adjust the contents to look like this for each client account:
uid 501 1001 # mac michael user to hduser user gid 20 1001 # mac staff group to hadoop group
No additional configuration is needed because we already configured HDFS for this in the prior post.
Starting NFS Service
Now you are ready to start the NFS service. The order of the following commands is important, and should be issued as the
service rpcbind stop start-dfs.sh sudo -u root /usr/local/hadoop/sbin/hadoop-daemon.sh --script /usr/local/hadoop/bin/hdfs start portmap hadoop-daemon.sh --script /usr/local/hadoop/bin/hdfs start nfs3
Connecting to the HDFS NFS Mount
Back on your Mac OS X machine, we will need to create a mount point and mount the HDFS NFS drive. You are free the name the mount location on your mac whatever you like, and replace the IP address in the second command below with the address that your network router assigned to the master node’s ethernet dongle.
sudo mkdir -p /mnt/mini-cluster sudo mount -t nfs -o vers=3,proto=tcp,nolock,noacl,sync 192.168.1.50:/ /mnt/mini-cluster
That’s it. The HDFS volume on your ODROID cluster is no mounted on your Mac. You can interact with the mount either from the Terminal or in the Finder. Note that the NFS mount does not have all the features of a local Mac OS X drive, most notably you cannot do random reads and writes to the drive. But you can certainly copy files to HDFS by dragging them in the Finder from your Mac to the to a folder on the mount. In the Mac OS X Finder, navigate to the your user folder you created on the HDFS mount, and create a folder called
Find and download some large text files from the internet. You can some at these locations:
- Project Gutenberg
- Donald Trump’s Speech Corpus
- All the text messages pertaining to the 9/11 attack in NYC
- Full text of the US Federal Laws
Download some of these text files, then copy them to the
/user/michael/text-files folder on your HDFS mount (of course, replace “michael” with the name you gave your user folder). The goal here is to have about 100MB of text data across 2 or more text files in this directory. We will play with this text data in the next post where we walk through the “Hello World” of Map/Reduce – the word count program.
Shutting Down NFS Service
If you wish to bring the NFS service and the HDFS system down, issue the following commands:
hadoop-daemon.sh --script /usr/local/hadoop/bin/hdfs stop nfs3 sudo -u root /usr/local/hadoop/sbin/hadoop-daemon.sh --script /usr/local/hadoop/bin/hdfs stop portmap stop-dfs.sh