Three years ago, I worked through a project of creating a low cost, low power computer cluster, with the primary goal of becoming more familiar with the inner works of Apache Spark. This project did accomplish that goal, but since this cluster was made up of 32-bit ARM processors and each had only 2 GB of RAM, the cluster was not too useful for getting meaningful work done. What it did excel at, however, was showing the user how to write Spark code efficiently, as if you wanted to get anything done on this constrained system you had to be mindful of inefficiencies.
Jump ahead to 2019, and I have decided to give it another go. This time, I want to make a cluster that is moderately useful for data analysis and machine learning, but does not break the bank either. The first step is to document my system design requirements and general goals:
- The cluster’s primary purpose will be to run Apache Spark to do data analysis. However, I would also like to experiment with ElasticSearch.
- The cluster should be usable for data sets up to 256 GB or more in size, have have storage for a few terabytes of data.
- The cluster should use 64 bit x86 CPUs. My last cluster was based on 32 bit ARM CPUs. This created some compatibility issues that were interesting to sort through, but I don’t want to deal with those sort of problems this time.
- Building, configuring, and operating this cluster should facilitate the following learning objectives:
- Understand the hardware tradeoffs for performance in distributed computing systems.
- Advance my understanding and skill in Apache Spark
- Learn how to run Apache Spark in Kubernetes.
- Experiment with HDFS 3.x and compare it to QFS 2.x.
- My target budget is $2,500, and I would prefer to come in under that but would go over if there is value in it.
- I want at least four physical nodes.
Probably the biggest driver in my hardware selection is my budget. I would love to get some of the new server-class motherboards. For example, SuperMicro’s new motherboard based on AMD’s EPYC 3251 CPU particularly caught my eye. But building out a cluster node that reasonably leverages the features of the motherboard resulted in a per node code of nearly $2,000. That would be my entire budget for a single node. Given my goal of have a distributed hardware environment, I would have to scale back my expectation on per-node capabilities.
Fortunately in the past three years since I built my original cluster, x86 single board computers have really taken off. There are a number of models, but two of them particularly stand out: the UDOO Bolt V8 and the ODROID H2. An honorable mention goes to the UP Xtreme board, but at the time of this writing it was still in kickstarter mode and thus not available for consideration.
In order to make a selection, first we must compare the key attributes of each system:
|ODROID H2||UDOO Bolt V8|
|Memory||0 GB installed|
max 32 GB DDR4
|0 GB installed|
max 32 GB DDR4
|AMD Ryzen™ V1605B|
|Cores & Threads||4, 4 threads||4, 8 threads|
M.2 NVME, SATA 3
|32 GB eMMC |
M.2 NVME, SATA 3
|Ethernet LAN||2x Gigabit||1x Gigabit|
The first thing that stands out is the price difference between the ODROID H2 and the UDOO Bolt. That seems pretty significant at first, as I could have more nodes, and thus more computational power, for the same budget. But need to also consider the number of threads and the CPU performance.
The UDOO Bolt has twice as many threads as the ORDOID H2. This is significant when designing a Spark cluster. Spark will take advantage of data locality as much as possible. That is, Spark will try to use the thread on the same machine as where the data is located. This is more easily done if there are more threads on the node where the data is. For this consideration, the UDOO Bolt wins. You can read more about Spark’s approach to data locality here.
On the other hand, the more RAM a node has relative to its thread count the better Spark performs. This is because the more RAM a node has, the higher chance that Spark can keep data at the highest level of data locality for a task, which is keeping the data in RAM in the same JVM that the task is running on. The UDOO Bolt has 4 GB RAM per thread, the ODROID H2 has 8 GB per thread. In this case, the ODROID H2 wins.
Another comparison point is the networking. The UDOO Bolt has one gigabit LAN port, while the ODROID H2 has two. This means that that the ODROID H2 will have a higher overall throughput of data than the UDOO Bolt. A higher networking bandwidth will be benefit Spark during shuffles as data is being moved between nodes, and during the time when data is being initially loaded off of HDFS (or whatever file system is used). It should be noted that even though the ODROID H2 has 2 gigabits of overall bandwidth, any single data stream can only be transferred at a maximum of 1 gigabit because the data stream can only go through one LAN port. The multiple LAN ports only allow multiple streams to be transferred in parallel. Given that there is less threads per node on the ODROID H2, to get an equivalent amount of parallelism as what the UDOO Bolt can get with twice the thread would require an ODROID H2 cluster based to have more nodes, which means the LAN will be used more to move data around. I would call it a weak win for the ODROID H2 at this point. But, UDOO sells an LAN port expander for the Bolt, which would bring the total LAN ports to three for the Bolt. With this expansion, The UDOO Bolt is the winner for network bandwidth and ability to best leverage it.
Finally, there is the CPU’s benchmark. To evaluate this, I am using PassMark Software’s CPU Benchmark database. Using this as a reference, we see that the the UDOO Bolt’s AMD Ryzen™ V1605B is over 2.8x times faster than then ODROID H2’s Intel J4105. In fact, if you look at this from a benchmark score per thread basis (which isn’t the best thing to do, but bear with me), you see that the UDOO Bolt is at 931 and the ODROID H2 is at 658. Note only does the AMD Ryzen™ V1605B have more threads than the Intel J4105, but each thread is faster than. Furthermore, consider that in reality, not all the threads on each node are going to be dedicated to Spark. At least one will be dedicated to the HDFS filesystem, and maybe another other system services. Given that, we see that each node of the UDOO Bolt will have six threads available to Spark while the ODROID H2 will only have two. Given all that in combination with each thread’s average benchmark, I would have to get at least 4 ODROID H2 nodes to have the same Spark computational through put of a single UDOO Bolt. As a result, the ODROID H2 does not have a real cost advantage when comparing computing power. The UDOO Bolt is the clear winner when it comes to the computational throughput, and as a result, ends up being a better value for the money too (at least in the context of my use case).
After considering all this, the hardware choice for me will be the UDOO Bolt. This is not to say that the ODROID H2 isn’t an interesting machine. I could build a less capable four node cluster with it that is much cheaper than the UDOO Bolt. In fact, I am likely to buy one or two of the H2s for other use cases and because I am a fan of the ODROID line of products. But for this project’s goals, the UDOO Bolt is the best choice.
One interesting comparison is to look at the difference between the UDOO Bolt’s CPU and my current (old) MacBook Pro, and the currently selling MacBook Pro model. The CPU benchmark for the Intel Core i7 CPU that my 2014 model MacBook Pro has is 9355, and the 15″ 2019 MacBook Pro is 13765. The UDOO Bolt at 7454 compares well, and when I consider the fact that I will be making a cluster of four UDOO Bolts, I would expect the data processing power of this cluster to be at least three times better than my current laptop, and two times better than the current MacBook Pro model.
Another comparison is to the cluster I had built three years ago with ODROID XU-4 nodes. Unfortunately, I could not find a benchmark comparison for the ODROID XU-4’s Samsung Exynos 5422 CPU. However, I will note that the Samsung Exynos 5422 is a 32 bit CPU, and that I found in practice that since the XU-4 was so memory constrained, I could only get 1 thread on the Samsung Exynos 5422 to work reliably with Spark. I will eventually do a direct comparison of my XU-4 cluster versus the Bolt cluster, so stay tuned.
Cluster Node Design
The UDOO Bolt does not come with everything I need to get it up and running. Given my goals stated above, I determined that I would configure each node as follows:
|UDOO Bolt V8||Mouser||$418.00|
|Crucial 32 GB DDR4 RAM Kit||Amazon||$133.95|
|Intel 660p 1 TB NVMe SSD||Amazon||$99.69|
|19.5V 65W Power Adapter, 4.5mm x 3.0mm jack||Amazon||$11.98|
The reason I am maxing out the UDOO Bolt’s RAM capacity is because Apache Spark performs the best with large amounts of RAM. If I build a cluster of four nodes, the cluster will have a RAM pool of 128 GB of RAM. This should be sufficient to analyze 256 GB of data at once. Most of the time due to initial filtering, the amount of data holds in RAM is smaller than the original original set on disk. However, if SPARK does need to hold onto more than 128 GB of data, it can still spill some of that data to disk. At which point using a high performance SSD will help with performance. The Intel 660p can read and write data well over 1.5 GB per second on the UDOO Bolt. This is faster than using a SATA III SSD. There are faster NVMe SSDs available, such as the Samsung EVO 970, but its price is higher too. I found the Intel 660p to be a good balance between price, capacity, and performance.
This brings the total cost of each node to $663.62. You will note that I did not list the ethernet port expander for each node. I plan to create the basic cluster first, then maybe later add the port expanders to demonstrate the impact of networking on Spark’s performance.
The cluster I am going to build is going to follow the same basic network design as my last one, which I explained here. Applying this basic design to a cluster of four UDOO Bolt nodes, I would need to additionally purchase:
|4x UDOO Bolt Nodes||See Above||$2654.48|
|5 port gigabit switch||Amazon||$14.99|
|5x 1 ft Cat 5e cables||Amazon||$6.29|
|USB 3 Type C Ethernet Adapter||Amazon||$15.99|
|10x M3 50mm male/female PCB Spacer||eBay||$4.46|
|10x M3 24mm male/female PCB spacer||eBay||$2.75|
|25x M3 locking nut||eBay||$0.99|
This brings the total cluster cost to $2699.95, about $200 over my target. Given the impressive specifications of the UDOO Bolt, I am OK with that. There is on more thing you might want to buy if you don’t have one already: a power strip to plug all the power adapters into.
At the time of this post’s writing, my UDOO Bolts are on pre-order. I should get them near the end of July 2019. Until then, I will research how to configure each node to implement a Kubernetes-manager cluster.