How to fix null values in Spark DataFrame columns
No telling if I’ll ever need this again, but this weekend I was helping someone with some Scala Spark work, and the short version of the story is that they were ending up with null values in their data after creating a Spark join
. The null values were ending up in two fields, one named balance and another named accountId, so I created these two Spark udf
functions to fix the data, converting null values into Long
in the first example, and null values into empty strings in the second example:
val fixBalance = udf((s: String) => if (s==null) 0 else s.toLong)
val df2: DataFrame = df.withColumn("balance", fixBalance($"balance"))
val fixAccountId = udf((s: String) => if (s==null) "" else s)
val df3: DataFrame = df2.withColumn("accountId", fixAccountId($"accountId"))
Notice that I started with a Spark DataFrame
named df
, then created df2
and then df3
. So then the final solution involved using df3
— which had the corrected, non-null data, thanks to the udf
functions — like this:
val res: Dataset[CustomerAccounts] = df3.groupBy( ...
In summary, if you ever have null values in Spark DataFrame columns, I hope these examples of how to fix those null values is helpful. There may be other ways to solve this problem, but this solution worked for what we were doing this weekend.
Reporting live from Boulder, Colorado,
Alvin Alexander
Recent blog posts
- Salar Rahmanian's newsletter (and Functional Programming, Simplified)
- Our “Back To Now” app: Now available on iOS and Android
- An Android location “Fused Location Provider API” example
- How to access fields in a Kotlin object from Java
- Kotlin Quick Reference (a Kotlin/Android book)
- How to fix null values in Spark DataFrame columns
- Useful, common, Docker commands for managing containers and images
- My Automated GUI Testing software
- ScalaCheck property-based testing
- User Story Mapping (a business systems analysis technique)