Running the Word Count Job with Hadoop

Now that we have Hadoop up and running on the ODROID XU4 cluster, let’s take it for a spin. Every technical platform has a “Hello World” type project, and for Hadoop and other map-reduce platforms, it is Word Count. In fact, the Apache Hadoop MapReduce Tutorial uses Word Count to demonstrate how to write a Hadoop job.  We are going to use that tutorial’s code. However, at the time of this writing the WordCount example on the Apache Hadoop site has a bug in it. I’ve corrected that bug and posted the updated code to my Github repository for the ODROID XU4 cluster project.

Log into the master node as hduser. We need to update the environment variables so that you can easily build Hadoop jobs. Do this by editing the .bashrc file and add the following at the end:

export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")..
export PATH=$PATH:/usr/local/hadoop/sbin:/usr/local/hadoop/bin:${JAVA_HOME}/bin
export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar

Then grab the corrected Word Count job code from the GitHub repository I set up and build the job jar. I’m am going to forgo explaining the code in this article. However, if you want to understand the code, the Apache Hadoop tutorial does a good job explaining it.

cd ~/
git clone https://github.com/DIYBigData/odroid-xu4-cluster.git
cd odroid-xu4-cluster/hadoop/examples/WordCount/
hadoop com.sun.tools.javac.Main WordCount.java
jar cf wc.jar WordCount*.class

The next step is to grab some text files and load them into HDFS. For our test run, we will grab test files from a corpus of blog posts known as the Blog Authorship Corpus[1]. Grab the data set, unpack it, then put the files into your user folder on HDFS. If you haven’t already, use start-dfs.sh to start HDFS on the cluster.

cd ~/
wget "http://www.cs.biu.ac.il/~koppel/blogs/blogs.zip"
unzip blogs.zip
hdfs dfs -mkdir -p /user/michael/data/blogs
hdfs dfs -put blogs/* /user/michael/data/blogs

This will take a while. On my cluster it took 50 minutes 38 seconds. Watch the blinking lights on the network ports to keep yourself entertained.

You could also point your browser to port 50070 on the cluster’s external IP address and inspect the file system’s monitoring page. For my set up, the link was http://192.168.1.50:50070. If you look on the Datanodes tab of the HDFS status page, you will note that the master node is getting a disproportionate number of blocks. The reason for this is that given our setup, HDFS will place at least one copy of each file put into it onto the same data node that the put operation was executed from. Since we have  the cluster set up with a replication factor of two, each file block will have a copy on both the master node and a random slave node. Given the scale at which we are operating this cluster and the fact that we have replication of 2 enabled, this data imbalance is not problematic as when data needs to be accessed, it doesn’t have to be accessed from the same node. However, if you ever want to balance your cluster, you can use the following command:

hdfs balancer -threshold 3

This command will start moving blocks around on your cluster so that replicants are never on the same node and the percent utilization of each node is within 3% of the overall utilization of the cluster.

One thing to note about the data we are putting unto HDFS is that it is composed of 19320 files, some of them large and some small. The critical fact to know is that HDFS will create at least one block for each file being placed onto it, even if the file size is less than the block size. Recall from when we configured HDFS we set the block size to 16 MB. This means that even if a file is 400 bytes long, a 16 MB file block will get allocated and used for each replication. This is why uploading the blog corpus as-is take a while as there is a large number of block allocations occurring.

As an experiment, let’s see how long it would take to upload the same amount of data but as one large text file. To do that, we need to concatenate the blog corpus files into one file, and then put that on the file system.

cat blogs/* >> ~/blogs_all.xml
hdfs dfs -put blogs_all.xml /user/michael/data/

On my cluster, this took only 1 minute 44 seconds, over 28 times faster for the same amount of data! This discrepancy gets to the core of how to use Hadoop right. Hadoop is not intended to be used with a large number of small files. Instead, having a few large files will yield better performance when crunching data.

Finally, let’s run the Word Count job against our data set. First start YARN if it isn’t already running with the start-yarn.sh command. Then run the Word Count job against the multi-file blog data set:

hadoop jar odroid-xu4-cluster/hadoop/examples/WordCount/wc.jar WordCount /user/michael/data/blogs /user/michael/jobs/wordcount/output-slow

Your cluster’s XU4 cooling fans should be whirling hard while the job is running. Again, this will take a while. On my cluster, it took 14+ hours to complete. Now, see how long it will take with the concatenated input file:

hadoop jar odroid-xu4-cluster/hadoop/examples/WordCount/wc.jar WordCount /user/michael/data/blogs_all.xml /user/michael/jobs/wordcount/output-fast

On my cluster, this took a little over 13 minutes, over 55 times faster. The reason for this performance difference is due to the number of blocks on HDFS that had to be read. To get a sense for how many blocks were involved, request an summary report from HDFS:

hdfs fsck /user/michael/data -files -blocks -locations | less

This command will show you information about the site system blocks used for every file and where they are located. At the end of the output, you should see a summary report something like this:

Status: HEALTHY
 Total size:    1612397538 B
 Total dirs:    2
 Total files:   19321
 Total symlinks:                0
 Total blocks (validated):      19369 (avg. block size 83246 B)
 Minimally replicated blocks:   19369 (100.0 %)
 Over-replicated blocks:        0 (0.0 %)
 Under-replicated blocks:       0 (0.0 %)
 Mis-replicated blocks:         0 (0.0 %)
 Default replication factor:    2
 Average block replication:     2.0
 Corrupt blocks:                0
 Missing replicas:              0 (0.0 %)
 Number of data-nodes:          4
 Number of racks:               1
FSCK ended at Sun Jul 03 01:04:04 UTC 2016 in 22051 milliseconds


The filesystem under path '/user/michael/data' is HEALTHY

Note the total number of blocks used for files under the /user/michael/data directory. Search through the report output for blogs_all.xml. You will see that it takes up 49 blocks. Subtracting that from the overall block count means 19320 blocks, or the same number of individual blogs files, are used to hold the individual blog files. This difference in block count is the reason for the performance difference between the two runs of the same job over ostensibly the same data. Herein lies one of the most important lessons in big data performance optimization: the fewer the number of filesystem blocks involved with an operation, the faster it will be. In more practical terms, using fewer large files will perform better than many small files. Granted there is a sweet spot to the block count. Adjusting the filesystem block size so that the entire dataset can fit into one block will not yield any parallelism as work units are aligned to each block.

If you want to inspect the output from either job, just read the part file in the output directory. I like to pipe the file into less to make exploring the contents easier.

hdfs dfs -cat /user/michael/jobs/wordcount/output-fast/part-r-00000 | less

You will note there is also of gibberish in the file. This is because we did not clean the input text. That’s OK, this article is focussed on what it takes to run a map reduce job, not the usefulness of the data analysis done.

In the next article we will dig deeper into tweak Hadoop parameters to yield the best performance possible.


References

[1] J. Schler, M. Koppel, S. Argamon and J. Pennebaker (2006). Effects of Age and Gender on Blogging in Proceedings of 2006 AAAI Spring Symposium on Computational Approaches for Analyzing Weblogs. (pdf)

Leave a Reply