How to fix null values in Spark DataFrame columns

No telling if I’ll ever need this again, but this weekend I was helping someone with some Scala Spark work, and the short version of the story is that they were ending up with null values in their data after creating a Spark join. The null values were ending up in two fields, one named balance and another named accountId, so I created these two Spark udf functions to fix the data, converting null values into Long in the first example, and null values into empty strings in the second example:

val fixBalance = udf((s: String) => if (s==null) 0 else s.toLong)
val df2: DataFrame = df.withColumn("balance", fixBalance($"balance"))

val fixAccountId = udf((s: String) => if (s==null) "" else s)
val df3: DataFrame = df2.withColumn("accountId", fixAccountId($"accountId"))

Notice that I started with a Spark DataFrame named df, then created df2 and then df3. So then the final solution involved using df3 — which had the corrected, non-null data, thanks to the udf functions — like this:

val res: Dataset[CustomerAccounts] = df3.groupBy( ...

In summary, if you ever have null values in Spark DataFrame columns, I hope these examples of how to fix those null values is helpful. There may be other ways to solve this problem, but this solution worked for what we were doing this weekend.

Reporting live from Boulder, Colorado,
Alvin Alexander