I have always been interested in large scale computing. This interest started back in graduate school when I was studying astrodynamics. Most of the problems you want to solve in astrodynamics requires numerical computation, as only the 2-body problem has a closed form solution. In order to solve anything more complex, you first have to make some simplifying assumptions (e.g., no other gravitational influences), and then you have a set of partial differential equations that will describe the motion of your system. The only way to use these equations to predict the position of the bodies was through numerical integration. In simple systems this sort of calculations was pretty straightforward, though you have to pay a lot of attention to round off error. However, when if you were trying to predict the progression of a debris field in space, which would require you to simultaneously project the motion of tens of thousands of objects, parallelized computation start to look attractive.
Compute centric paralyzation is only one form of distributed computing. With the rise of the internet and the increased ease at generating and retaining data, a new class of distributed computing problems arose that focussed on the processing large data sets, especially ones that cannot fit on a single computer. Google introduce a paradigm handling these big data problems through an approach called map reduce and developing a way to distribute a file system across many machines. Open source solutions to the big data problem were developed, such as Hadoop and more recently Spark. All of these solutions are focussed on the idea breaking up large data sets into smaller chunks and then analysisinp the chunks in parallel, aggregating the results into a comprehensive answer.
I have used big data platforms such as Hadoop, Spark, and QFS for the past 8 years of my career to solve many sorts of problems, from identifying fraudulent web activity, to analyzing the power of social networks in driving web traffic, to predicting the demographics of of the audience that watches a specific YouTube video. In all of this, I have gotten reasonably adept at leveraging big data platforms to solve business problems, but I have only acquired what I would describe as a basic familiarity with how the underlying platforms work. So I decided to embark on a project of setting up my own cluster in the hopes that by setting up, configuring, and optimizing a distributed computing system, I will better understand the underlying technology that I have been using for so long.
One of the primary goals of my project is to use “real steel” when building my cluster. There are plenty of platforms where you can spin up a virtual cluster, such as AWS. But I am also interested in building my understanding how hardware configuration impacts a clusters performance. So actual computers it is. The second goal is to focus on the Spark computing platform. I have used a number of platforms over the years, and I have found Spark to be the most elegant from a user perspective. It is also the focus of my current professional activity. So, I will set out to build a Spark cluster on real computers.
I plan to execute this project in at least two phases. The first is to use low cost computer boards, such as the Raspberry Pi, to develop my initial understanding of how to set up and configure a cluster. Once I have built my experience using low cost solutions, then I intend build a “personal cluster” that should be affordable but also large enough to reasonably handle Spark-based data analysis work for data sets in the one to five terabyte range. Since I am doing this in my spare time, this project will likely take a while to execute. But do follow along, maybe we can learn something together.
One thought on “DIY Big Data Project Goal”