PySpark – DIY Big Data

Identifying Bot Commenters on Reddit using Benford’s Law

March 14, 2020March 15, 2020 michael

How can you identify non-human actors on social media? There are many ways to do this, and each method has its strengths and weaknesses. In this post, I discuss how to use Benford’s Law to identify non-human actors in user interaction logs. Application of Benford’s Law Benford’s Law is an observation that a collection of numbers that measure naturally occurring events of items tend to have a logarithm frequency distribution for the first digit of these numbers. The are several characteristics of a naturally occurring set of numbers that Benford’s Law takes advantage of: The order of magnitude of the number in the set varies uniformly The numbers vary with multiplicative fluctuations The distribution of numbers is scale invariant The exact distribution of first digits that Benford’s Law predicts is: This results in a distribution that looks like this: For a collection of numbers, if the frequency of the numbers’ first digits does not align well with the distribution shown Read More …

Upgrading the Compute Cluster to 2.5G Ethernet

March 7, 2020June 27, 2020 michael

I recently updated my Personal Compute Cluster to use faster ethernet interconnect for the cluster network. After putting together a PySpark Benchmark, I was intrigued to see how faster networking within the cluster would help. Note: All product links below are affiliate links. Upgrading the Cluster Networking Hardware To upgrade the networking for the EGLOBAL S200 computers that I use within my cluster, my only real option was to use USB ethernet dongles. This is because the S200 computer has no PCIe expansion slots, but it does have a USB 3 bus. This gives me a few options. There are no 10 Gbps ethernet dongles for USB 3, but there are several 5 and 2.5 Gbps ethernet dongles. These are part fo the more recent NBASE-T ethernet standard which allows faster than 1 Gbps ethernet over Cat 5e and Cat 6 cabling. The first option I investigated was the StarTech USB 3 to 5 Gbps Ethernet Adapter. I am going Read More …

Benchmarking Software for PySpark on Apache Spark Clusters

January 24, 2020January 17, 2021 michael

Now that my Personal Compute Cluster is uninhibited by CPU overheating, I want to turn my configuration to work as efficiently as possible for the type of workloads I place on it. I searched around for Apache Spark benchmarking software, however most of what I found was either too older (circa Spark 1.x) or too arcane. I was able to get Ewan Higgs’s implementation of TeraSort working on my cluster, but it was written in Scala and not necessarily representative of the type of operations I would use in PySpark. So I set out to write my own. The primary goal of my benchmarking approach is to have a standard set of data and operations that I can compare the performance of before and after some change I make to my Spark deployment and be confident that any change in performance was due to the change in the Spark deployment and not due to variability in the benchmark. This attribute Read More …

Airline Flight Data Analysis – Part 1 – Data Preparation (Reprise)

December 1, 2019December 1, 2019 michael

As some of you know, I previously explored building a Spark cluster using ODROID XU-4 single board computers. I was able to demonstrate some utility with this cluster, but it was limited. One analysis I attempted looking at was the Airline On-Time Performance data set collected by the United States Department of Transportation. The XU-4 cluster allowed me to make summarization graphs of the data, but not much more. The primary constraint was the RAM pool the cluster had, which was 8 GB total RAM across the four nodes, and the 10 year data set was greater than 30 GB uncompressed. Now that I have built the Personal Compute Cluster, I decided to revise this data set to see if I could do more sophisticated analysis. The short answer is I can. But before I do that we need to prepare the raw data that we download from the Department of Transportation’s website into a format. Specifically we need to Read More …

Using Custom Hive UDFs With PySpark

October 30, 2016December 1, 2019 michael

Using Python to develop on Apache Spark is easy and familiar for many developers. However, due to the fact that Spark runs in a JVM, when your Python code interacts with the underlying Spark system, there can be an expensive process of data serialization and deserialization between the JVM and the Python interpreter. If you do most of your data manipulation using data frames in PySpark, you generally avoid this serialization cost because the Python code ends up being more of a high-level coordinator of the data frame operations rather than doing low-level operations on the data itself. This changes if you ever write a UDF in Python. To avoid the JVM-to-Python data serialization costs, you can use a Hive UDF written in Java. Creating a Hive UDF and then using it within PySpark can be a bit circuitous, but it does speed up your PySpark data frame flows if they are using Python UDFs. To illustrate this, I will rework the Read More …

DIY Big Data

One byte at a time …

Tag: PySpark

Identifying Bot Commenters on Reddit using Benford’s Law

Upgrading the Compute Cluster to 2.5G Ethernet

Benchmarking Software for PySpark on Apache Spark Clusters

Airline Flight Data Analysis – Part 1 – Data Preparation (Reprise)

Using Custom Hive UDFs With PySpark