Spark Infer Timestamp data type from JSON

Posted by ChenRiang on August 29, 2021

Apache Spark provides a feature to infer data schema based on the incoming data. However, in Spark version 3.1.2, it wrongly interprets the Timestamp field as String data type.

1
2
3
4
5
6
7
8
9
10
import pyspark
spark = pyspark.sql.SparkSession.builder.appName("MyApp").getOrCreate()

line = '{"myTimestamp" : "2017-09-30 04:53:39.412496Z"}'
rdd = spark.sparkContext.parallelize([line])
(
    spark.read
    .option("timestampFormat", "yyyy-MM-dd HH:mm:ss.SSSSSS'Z'")
    .json(path=rdd)
)

Result:

1
DataFrame[myTimestamp: string]


Solution

After some googling, I found out that there is an option inferTimestamp have to be enabled to allow Spark recognize timestamp data. This option is disabled by default in Spark 3.0.1 and above.

1
2
3
4
5
6
7
8
line = '{"myTimestamp" : "2017-09-30 04:53:39.412496Z"}'
rdd = spark.sparkContext.parallelize([line])
(
    spark.read
    .option("inferTimestamp", "true")
    .option("timestampFormat", "yyyy-MM-dd HH:mm:ss.SSSSSS'Z'")
    .json(path=rdd)
)

Result:

1
DataFrame[myTimestamp: timestamp]

** Note: inferTimestamp was disabled intentionally due to performance issue. It is recommended to provide schema and avoid schema generation on the fly when ingesting data.


Reference

  1. Migration Guide: SQL, Datasets and DataFrame - Documentation
  2. Interpret timestamp fields in Spark while reading json (timestampFormat) - SPARK-26325
  3. Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4 - SPARK-32130