Using Custom Hive UDFs With PySpark

Using Python to develop on Apache Spark is easy and familiar for many developers. However, due to the fact that Spark runs in a JVM, when your Python code interacts with the underlying Spark system, there can be an expensive process of data serialization and deserialization between the JVM and the Python interpreter. If you do most of your data manipulation using data frames in PySpark, you generally avoid this serialization cost because the Python code ends up being more of a high-level coordinator of the data frame operations rather than doing low-level operations on the data itself. This changes if you ever write a UDF in Python. To avoid the JVM-to-Python data serialization costs, you can use a Hive UDF written in Java. Creating a Hive UDF and then using it within PySpark can be a bit circuitous, but it does speed up your PySpark data frame flows if they are using Python UDFs. To illustrate this, I will rework the Read More …