michael – Page 3 – DIY Big Data

Connecting to HDFS from an computer external to the cluster

August 13, 2016 michael

Since I have set up my ODROID XU4 cluster to work with Spark from a Jupyter web notebook, one of the little annoyances I have had is how inefficient it was to transferring data into the HDFS file system on the cluster. It involved downloading data to my Mac laptop via a web download, transferring that data to the master node, then running hdfs dfs -put to place the data onto the file system. While I did set up my cluster to create an NFS mount, it can be very slow when transferring large files to HDFS. So, my solution to this was to install HDFS on my Mac laptop, and configure it to use the ODROID XU4 cluster. One of the requirements for an external client (e.g., your Mac laptop) to interact with HDFS on a cluster is that the external client must be able to connect to every node in the cluster. Since the master node of the cluster was Read More …

Coverting Apache Access Logs to Parquet Backed Data Frames on Spark

August 7, 2016 michael

One of the analysis I am looking to do with my ODROID XU4 cluster is to take a look at various access patterns on my website by analyzing the Apache http access logs. Analyzing Apache access logs directly in Spark can be slow due to them being unstructured text logs. Converting to the logs to a data frame backed by partitioned parquet files can make subsequent analysis much faster. The first task is to create a mapper that can be used in Spark convert a row int eh access log to a Spark Row object. A Python 3 mapper would look like: # Parse an Apache access log. Assumes Python 3 import re from pyspark.sql import Row from datetime import datetime APACHE_ACCESS_LOG_PATTERN = ‘^(\S+) (\S+) (\S+) \[([\w:/]+\s[+\-]\d{4})\] “(\S+) (\S+) (\S+)” (\d{3}) (\d+) “((?:[^”]|”)+)” “((?:[^”]|”)+)”$’ DATETIME_PARSE_PATTERN = ‘%d/%b/%Y:%H:%M:%S %z’ # Returns a Row containing the Apache Access Log info def parse_apache_log_line(logline): match = re.search(APACHE_ACCESS_LOG_PATTERN, logline) if match is None: return None Read More …

Using the Quantcast File System with Spark on the ODROID XU4 Cluster

July 31, 2016 michael

I used to work at a company named Quantcast, which is known for processing petabyte-scale data sets. When operating at that scale, any inefficiency in your data infrastructure gets multiplied many times over. Early on, Quantcast found that the inefficiencies in Hadoop literally cost them by requiring more hardware to obtain certain levels of performance. So Quantcast set out to write a highly efficient big data stack. At the bottom of that stack was Quantcast’s distributed file system called QFS. Quantcast open-sourced QFS, and performed several studies to show the efficiency gains QFS had over HDFS. Given the first hand experience I have had with QFS performance, I thought it would interesting to get it running on the ODROID XU4 cluster in combination with Spark to see how much QFS would improve processing speeds in our set up. Spoiler alert: there was no time improvements, but QFS required less memory than HDFS, allowing me to assign more RAM to Spark. Installing Read More …

Installing Spark onto the ODROID XU4 Cluster

July 24, 2016 michael

While installing and using Apache Hadoop on the ODROID XU4 cluster is cool, my ultimate goal was to get the more modern Apache Spark application for data analysis. In this post, we will not only install Spark, but also install the Jupyter Notebook for interacting with Spark conveniently from your computer’s browser. The installation instruction here will install Saprk and run it as a stand alone cluster in combination with the HDFS service we have previously installed. Installing Spark ends up being much simpler than installing Hadoop, but I found it to be more temperamental than Hadoop. Most notably, Spark is sensitive to the different CPU core speeds of the heterogenous octal-core CPU that the XU4 uses. If the Spark processes would run on the slower cores of the CPU, the master node would think the slave processes are timing out (missing a heartbeat), in turn restarting the executors on the slave nodes. This causes very inefficient repetition of work Read More …

What Every Data Scientist Should Know About Floating-Point Arithmetic

July 8, 2016 michael

In 1991 David Goldberg at Xerox PARC published the seminal paper on floating point arithmetic titled “What every computer scientist should know about floating-point arithmetic.” This paper was especially influential in the 1990’s and early 2000’s when limitations in computer hardware drove people to operate in a regime that most exposes them to the limitations of floating point arithmetic, specifically using 32 bit floats for storing and calculating floating point numbers. With modern 64 bit architecture and low storage and RAM costs, the incentive to use 32 bit floats is no longer there and computer scientist generally use 64 bit double precision floats, which greatly reduces them to the issues of floating point arithmetic. However, 64 bit floats do not remove the issues, it just proverbially kicks the can down the road waiting for usage patterns to evolve to once again cause floating point issues to become prominent again. With Big Data and extremely large data sets, both data scientists and computer scientists need to Read More …

Running the Word Count Job with Hadoop

July 2, 2016 michael

Now that we have Hadoop up and running on the ODROID XU4 cluster, let’s take it for a spin. Every technical platform has a “Hello World” type project, and for Hadoop and other map-reduce platforms, it is Word Count. In fact, the Apache Hadoop MapReduce Tutorial uses Word Count to demonstrate how to write a Hadoop job. We are going to use that tutorial’s code. However, at the time of this writing the WordCount example on the Apache Hadoop site has a bug in it. I’ve corrected that bug and posted the updated code to my Github repository for the ODROID XU4 cluster project. Log into the master node as hduser. We need to update the environment variables so that you can easily build Hadoop jobs. Do this by editing the .bashrc file and add the following at the end: export JAVA_HOME=$(readlink -f /usr/bin/java | sed “s:bin/java::”).. export PATH=$PATH:/usr/local/hadoop/sbin:/usr/local/hadoop/bin:${JAVA_HOME}/bin export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar Then grab the corrected Word Count job code from the Read More …

Mounting HDFS with NFS

June 26, 2016 michael

After setting up the Hadoop installation on the ODROID XU4 cluster, we need to find a way to get data in and out of it. The traditional pattern used when a cluster is on it’s own network such as ours is is to have an edge node where the user logs into, transfers the data to that edge node, then put that data in HDFS from the edge node. Speaking from experience, this is annoyingly to much work. For my personal cluster, I want the HDFS file system to integrate with my Mac laptop. The most robust way to accomplish my goal with HDFS is to have it mounted as a NFS drive. The Hadoop distribution we are using has a NFS server built in. This server is run on the master node, effectively acting as a proxy between the HDFS cluster and the external network. The pros to this approach is that I get the usage paradigm that I want. Read More …

Installing Hadoop onto an ODROID XU4 Cluster

June 25, 2016 michael

Now is the time when we start to see the fruits of our labor in getting the ODROID XU4 low cost cluster built. We will be installing Hadoop and configuring it to serve an NFS mount that can be mounted on your client computer (e.g., your laptop) to be able to interact with the HDFS file system as if it were another hard drive on your computer. This feature will greatly ease the use of our cluster, as it will minimize the need for a user to log into the cluster to use it. An NFS mount is not the only necessary facet of the cluster to enable the client usage vision, but it is an important one. Before we install Hadoop, let’s discuss what we are trying to accomplish by installing it. Hadoop has three components: the Hadoop File System (HDFS), Yarn, and Map-Reduce. For our purposes, we are most interested in HDFS, but we will play around with the Read More …

Adding the MicroSD Data Drives to the ODROID XU4 Cluster

June 20, 2016 michael

We are nearly done with the basic set up of the ODROID XU4 four node cluster. The final step is to configure the MicroSD cards that were in the bill of materials as data drives for each node. Before we start, ensure that the cluster is completely powered down and remove the power cords from each node. We will be removing and adding components to the ODROID XU4’s, and it is just best practice to have no power to the device when removing or adding components. The first step is to format each MicroSD card with the ext4 filesystem. Each platform has it’s own instructions for how to do this. I will provide the Mac OS X instructions since that is what I use. In the terminal, do the following: If you don’t have it already, install the Homebrew package management app for OS X. Visit the brew.sh website for instructions. Install the e2fsprogs utility using brew install e2fsprogs With the MicroSD card attached to your computer, Read More …

Configuring DHCP and NAT in ODROID XU4 Cluster

June 20, 2016 michael

UPDATE – I have rebuilt this cluster to use Ubuntu 16.04. You can find updated instruction for Ubuntu 16.04 here. As was discussed in the network design post, we will set up the master node as a router to manage network traffic in and out of the cluster. Before starting, ensure that all of the slave nodes have been powered down, that your home network is still connected directly to the open port on the cluster’s ethernet switch, that you have collected each node’s MAC address, and that the master node is powered up and you are logged into it via SSH. The first step is to explicitly set up the networking interfaces for both the eth0 and eth1 device on the master node. Note that by default the Ubuntu system that was installed on the master node treats eth0 a as requesting a DHCP lease on the network it is attached to. This is why it got an IP Read More …