July 2016 – DIY Big Data

Using the Quantcast File System with Spark on the ODROID XU4 Cluster

July 31, 2016December 2, 2017 michael

I used to work at a company named Quantcast, which is known for processing petabyte-scale data sets. When operating at that scale, any inefficiency in your data infrastructure gets multiplied many times over. Early on, Quantcast found that the inefficiencies in Hadoop literally cost them by requiring more hardware to obtain certain levels of performance. So Quantcast set out to write a highly efficient big data stack. At the bottom of that stack was Quantcast’s distributed file system called QFS. Quantcast open-sourced QFS, and performed several studies to show the efficiency gains QFS had over HDFS. Given the first hand experience I have had with QFS performance, I thought it would interesting to get it running on the ODROID XU4 cluster in combination with Spark to see how much QFS would improve processing speeds in our set up. Spoiler alert: there was no time improvements, but QFS required less memory than HDFS, allowing me to assign more RAM to Spark. Installing Read More …

Installing Spark onto the ODROID XU4 Cluster

July 24, 2016January 21, 2017 michael

While installing and using Apache Hadoop on the ODROID XU4 cluster is cool, my ultimate goal was to get the more modern Apache Spark application for data analysis. In this post, we will not only install Spark, but also install the Jupyter Notebook for interacting with Spark conveniently from your computer’s browser. The installation instruction here will install Saprk and run it as a stand alone cluster in combination with the HDFS service we have previously installed. Installing Spark ends up being much simpler than installing Hadoop, but I found it to be more temperamental than Hadoop. Most notably, Spark is sensitive to the different CPU core speeds of the heterogenous octal-core CPU that the XU4 uses. If the Spark processes would run on the slower cores of the CPU, the master node would think the slave processes are timing out (missing a heartbeat), in turn restarting the executors on the slave nodes. This causes very inefficient repetition of work Read More …

What Every Data Scientist Should Know About Floating-Point Arithmetic

July 8, 2016August 3, 2022 michael

In 1991 David Goldberg at Xerox PARC published the seminal paper on floating point arithmetic titled “What every computer scientist should know about floating-point arithmetic.” This paper was especially influential in the 1990’s and early 2000’s when limitations in computer hardware drove people to operate in a regime that most exposes them to the limitations of floating point arithmetic, specifically using 32 bit floats for storing and calculating floating point numbers. With modern 64 bit architecture and low storage and RAM costs, the incentive to use 32 bit floats is no longer there and computer scientist generally use 64 bit double precision floats, which greatly reduces them to the issues of floating point arithmetic. However, 64 bit floats do not remove the issues, it just proverbially kicks the can down the road waiting for usage patterns to evolve to once again cause floating point issues to become prominent again. With Big Data and extremely large data sets, both data scientists and computer scientists need to Read More …

Running the Word Count Job with Hadoop

July 2, 2016July 16, 2016 michael

Now that we have Hadoop up and running on the ODROID XU4 cluster, let’s take it for a spin. Every technical platform has a “Hello World” type project, and for Hadoop and other map-reduce platforms, it is Word Count. In fact, the Apache Hadoop MapReduce Tutorial uses Word Count to demonstrate how to write a Hadoop job. We are going to use that tutorial’s code. However, at the time of this writing the WordCount example on the Apache Hadoop site has a bug in it. I’ve corrected that bug and posted the updated code to my Github repository for the ODROID XU4 cluster project. Log into the master node as hduser. We need to update the environment variables so that you can easily build Hadoop jobs. Do this by editing the .bashrc file and add the following at the end: export JAVA_HOME=$(readlink -f /usr/bin/java | sed “s:bin/java::”).. export PATH=$PATH:/usr/local/hadoop/sbin:/usr/local/hadoop/bin:${JAVA_HOME}/bin export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar Then grab the corrected Word Count job code from the Read More …

DIY Big Data

One byte at a time …

Month: July 2016

Using the Quantcast File System with Spark on the ODROID XU4 Cluster

Installing Spark onto the ODROID XU4 Cluster

What Every Data Scientist Should Know About Floating-Point Arithmetic

Running the Word Count Job with Hadoop