January 2020 – DIY Big Data

Benchmarking Software for PySpark on Apache Spark Clusters

January 24, 2020October 11, 2025 michael

Now that my Personal Compute Cluster is uninhibited by CPU overheating, I want to turn my configuration to work as efficiently as possible for the type of workloads I place on it. I searched around for Apache Spark benchmarking software, however most of what I found was either too older (circa Spark 1.x) or too arcane. I was able to get Ewan Higgs’s implementation of TeraSort working on my cluster, but it was written in Scala and not necessarily representative of the type of operations I would use in PySpark. So I set out to write my own. The primary goal of my benchmarking approach is to have a standard set of data and operations that I can compare the performance of before and after some change I make to my Spark deployment and be confident that any change in performance was due to the change in the Spark deployment and not due to variability in the benchmark. This attribute Read More …

Improving the cooling of the EGLOBAL S200 computer

January 1, 2020October 11, 2025 michael

When I set up my Personal Compute Cluster composed of EGLOBAL S200 computers (affiliate link), one of the first issues I noticed was that the CPUs would overheat when all 12 threads on the CPU (2 threads per core) were being ran at 100% load. This disappointed me about the S200 computer as it was otherwise perfect for my use case. So I set up to find a way to improve the situation. The first step is to be able to measure the CPU temperature by installing lm-sensors, which is a utility that allows you to read the CPU temperature. To install it on all machines in the cluster: Then you need to log into each machine in the cluster one by one and installation the sensor software by running: The software will try to detect the various sensors that the computer has. Accept all the default options as the software presents them to you, even if there is a Read More …