Benchmark – DIY Big Data

Upgrading the Compute Cluster to 2.5G Ethernet

March 7, 2020June 27, 2020 michael

I recently updated my Personal Compute Cluster to use faster ethernet interconnect for the cluster network. After putting together a PySpark Benchmark, I was intrigued to see how faster networking within the cluster would help. Note: All product links below are affiliate links. Upgrading the Cluster Networking Hardware To upgrade the networking for the EGLOBAL S200 computers that I use within my cluster, my only real option was to use USB ethernet dongles. This is because the S200 computer has no PCIe expansion slots, but it does have a USB 3 bus. This gives me a few options. There are no 10 Gbps ethernet dongles for USB 3, but there are several 5 and 2.5 Gbps ethernet dongles. These are part fo the more recent NBASE-T ethernet standard which allows faster than 1 Gbps ethernet over Cat 5e and Cat 6 cabling. The first option I investigated was the StarTech USB 3 to 5 Gbps Ethernet Adapter. I am going Read More …

Benchmarking Software for PySpark on Apache Spark Clusters

January 24, 2020January 17, 2021 michael

Now that my Personal Compute Cluster is uninhibited by CPU overheating, I want to turn my configuration to work as efficiently as possible for the type of workloads I place on it. I searched around for Apache Spark benchmarking software, however most of what I found was either too older (circa Spark 1.x) or too arcane. I was able to get Ewan Higgs’s implementation of TeraSort working on my cluster, but it was written in Scala and not necessarily representative of the type of operations I would use in PySpark. So I set out to write my own. The primary goal of my benchmarking approach is to have a standard set of data and operations that I can compare the performance of before and after some change I make to my Spark deployment and be confident that any change in performance was due to the change in the Spark deployment and not due to variability in the benchmark. This attribute Read More …

One byte at a time …

Tag: Benchmark

Upgrading the Compute Cluster to 2.5G Ethernet

Benchmarking Software for PySpark on Apache Spark Clusters