ARM7 CPUs, Double Alignment, and Apache Spark

I haven’t posted an update to my data analysis projects in a while. Partly because my day job has been a bit busy lately, and partly because what time I do have for my recreational coding has been taken up by a problem I was experiencing with Apache Spark. I started have stability problems on my ODROID XU4 cluster. I didn’t fully understand the cause at first, thinking for the longest time it was my own code. In the end, it proved to be a bug in spark, or more specifically, an incompatibility between Spark’s memory management and the ARM71 platform of my ODROID XU4 cluster. The issue has to do with how some CPUs operate on double floating point values. These CPUs, including the 32-bit ARM71 CPU found in the ODROID XU4, requires that when the CPU operates on a double floating point value the 8 bytes of memory used to contain the value should be aligned to 8 byte Read More …

Using Custom Hive UDFs With PySpark

Using Python to develop on Apache Spark is easy and familiar for many developers. However, due to the fact that Spark runs in a JVM, when your Python code interacts with the underlying Spark system, there can be an expensive process of data serialization and deserialization between the JVM and the Python interpreter. If you do most of your data manipulation using data frames in PySpark, you generally avoid this serialization cost because the Python code ends up being more of a high-level coordinator of the data frame operations rather than doing low-level operations on the data itself. This changes if you ever write a UDF in Python. To avoid the JVM-to-Python data serialization costs, you can use a Hive UDF written in Java. Creating a Hive UDF and then using it within PySpark can be a bit circuitous, but it does speed up your PySpark data frame flows if they are using Python UDFs. To illustrate this, I will rework the Read More …

Airline Flight Data Analysis – Part 2 – Analyzing On-Time Performance

In my last post on this topic, we loaded the Airline On-Time Performance data set collected by the United States Department of Transportation into a Parquet file to greatly improve the speed at which the data can be analyzed. Now, let’s take a first look at the data by graphing the average airline-caused flight delay by airline. This is a rather straightforward analysis, but is a good one to get started with the data set. Open a Jupyter python notebook on the cluster in the first cell indicate that we will be using MatPlotLib to do graphing: %matplotlib inline Then, in the next cell load data frames for the airline on time activity and airline meta data based on the parquet files built in the last post. Note that I am using QFS as my distributed file system. If you are using HDFS, simply update the file URLs as needed. air_data = spark.read.parquet(‘qfs://master:20000/user/michael/data/airline_data’) airlines = spark.read.parquet(‘qfs://master:20000/user/michael/data/airline_id_table’) Now we are ready to process Read More …

Airline Flight Data Analysis – Part 1 – Data Preparation

This data analysis project is to explore what insights can be derived from the Airline On-Time Performance data set collected by the United States Department of Transportation. The data can be downloaded in month chunks from the Bureau of Transportation Statistics website. The data gets downloaded as a raw CSV file, which is something that Spark can easily load. However, if you download 10+ years of data from the Bureau of Transportation Statistics (meaning you downloaded 120+ one month CSV files from the site), that would collectively represent 30+ GB of data. For commercial scale Spark clusters, 30 GB of text data is a trivial task. However, if you are running Spark on the ODROID XU4 cluster or in local mode on your Mac laptop, 30+ GB of text data is substantial. So, before we can do any analysis of the dataset, we need to transform it into a format that will allow us to quickly and efficiently interact with it. Fortunately, Read More …

Coverting Apache Access Logs to Parquet Backed Data Frames on Spark

One of the analysis I am looking to do with my ODROID XU4 cluster is to take a look at various access patterns on my website by analyzing the Apache http access logs. Analyzing Apache access logs directly in Spark can be slow due to them being unstructured text logs. Converting to the logs to a data frame backed by partitioned parquet files can make subsequent analysis much faster. The first task is to create a mapper that can be used in Spark convert a row int eh access log to a Spark Row object. A Python 3 mapper would look like: # Parse an Apache access log. Assumes Python 3 import re from pyspark.sql import Row from datetime import datetime APACHE_ACCESS_LOG_PATTERN = ‘^(\S+) (\S+) (\S+) \[([\w:/]+\s[+\-]\d{4})\] “(\S+) (\S+) (\S+)” (\d{3}) (\d+) “((?:[^”]|”)+)” “((?:[^”]|”)+)”$’ DATETIME_PARSE_PATTERN = ‘%d/%b/%Y:%H:%M:%S %z’ # Returns a Row containing the Apache Access Log info def parse_apache_log_line(logline): match = re.search(APACHE_ACCESS_LOG_PATTERN, logline) if match is None: return None Read More …

What Every Data Scientist Should Know About Floating-Point Arithmetic

In 1991 David Goldberg at Xerox PARC published the seminal paper on floating point arithmetic titled “What every computer scientist should know about floating-point arithmetic.” This paper was especially influential in the 1990’s and early 2000’s when limitations in computer hardware drove people to operate in a regime that most exposes them to the limitations of floating point arithmetic, specifically using 32 bit floats for storing and calculating floating point numbers. With modern 64 bit architecture and low storage and RAM costs, the incentive to use 32 bit floats is no longer there and computer scientist generally use 64 bit double precision floats, which greatly reduces them to the issues of floating point arithmetic. However, 64 bit floats do not remove the issues, it just proverbially kicks the can down the road waiting for usage patterns to evolve to once again cause floating point issues to become prominent again. With Big Data and extremely large data sets, both data scientists and computer scientists need to Read More …

DIY Big Data Project Goal

I have always been interested in large scale computing. This interest started back in graduate school when I was studying astrodynamics. Most of the problems you want to solve in astrodynamics requires numerical computation, as only the 2-body problem has a closed form solution. In order to solve anything more complex, you first have to make some simplifying assumptions (e.g., no other gravitational influences), and then you have a set of partial differential equations that will describe the motion of your system. The only way to use these equations to predict the position of the bodies was through numerical integration. In simple systems this sort of calculations was pretty straightforward, though you have to pay a lot of attention to round off error. However, when if you were trying to predict the progression of a debris field in space, which would require you to simultaneously project the motion of tens of thousands of objects, parallelized computation start to look attractive. Compute Read More …