Benchmarking Software for PySpark on Apache Spark Clusters

Now that my Personal Compute Cluster is uninhibited by CPU overheating, I want to turn my configuration to work as efficiently as possible for the type of workloads I place on it. I searched around for Apache Spark benchmarking software, however most of what I found was either too older (circa Spark 1.x) or too arcane. I was able to get Ewan Higgs’s implementation of TeraSort working on my cluster, but it was written in Scala and not necessarily representative of the type of operations I would use in PySpark. So I set out to write my own. The primary goal of my benchmarking approach is to have a standard set of data and operations that I can compare the performance of before and after some change I make to my Spark deployment and be confident that any change in performance was due to the change in the Spark deployment and not due to variability in the benchmark. This attribute Read More …

ARM7 CPUs, Double Alignment, and Apache Spark

I haven’t posted an update to my data analysis projects in a while. Partly because my day job has been a bit busy lately, and partly because what time I do have for my recreational coding has been taken up by a problem I was experiencing with Apache Spark. I started have stability problems on my ODROID XU4 cluster. I didn’t fully understand the cause at first, thinking for the longest time it was my own code. In the end, it proved to be a bug in spark, or more specifically, an incompatibility between Spark’s memory management and the ARM71 platform of my ODROID XU4 cluster. The issue has to do with how some CPUs operate on double floating point values. These CPUs, including the 32-bit ARM71 CPU found in the ODROID XU4, requires that when the CPU operates on a double floating point value the 8 bytes of memory used to contain the value should be aligned to 8 byte Read More …

Using Custom Hive UDFs With PySpark

Using Python to develop on Apache Spark is easy and familiar for many developers. However, due to the fact that Spark runs in a JVM, when your Python code interacts with the underlying Spark system, there can be an expensive process of data serialization and deserialization between the JVM and the Python interpreter. If you do most of your data manipulation using data frames in PySpark, you generally avoid this serialization cost because the Python code ends up being more of a high-level coordinator of the data frame operations rather than doing low-level operations on the data itself. This changes if you ever write a UDF in Python. To avoid the JVM-to-Python data serialization costs, you can use a Hive UDF written in Java. Creating a Hive UDF and then using it within PySpark can be a bit circuitous, but it does speed up your PySpark data frame flows if they are using Python UDFs. To illustrate this, I will rework the Read More …