Spark Infer Timestamp data type from JSON

Apache Spark provides a feature to infer data schema based on the incoming data. However, in Spark version 3.1.2, it wrongly interprets the Timestamp field as String data type.

import pyspark
spark = pyspark.sql.SparkSession.builder.appName("MyApp").getOrCreate()

line = '{"myTimestamp" : "2017-09-30 04:53:39.412496Z"}'
rdd = spark.sparkContext.parallelize([line])
(
    spark.read
    .option("timestampFormat", "yyyy-MM-dd HH:mm:ss.SSSSSS'Z'")
    .json(path=rdd)
)

Result:

DataFrame[myTimestamp: string]

Solution

After some googling, I found out that there is an option inferTimestamp have to be enabled to allow Spark recognize timestamp data. This option is disabled by default in Spark 3.0.1 and above.

line = '{"myTimestamp" : "2017-09-30 04:53:39.412496Z"}'
rdd = spark.sparkContext.parallelize([line])
(
    spark.read
    .option("inferTimestamp", "true")
    .option("timestampFormat", "yyyy-MM-dd HH:mm:ss.SSSSSS'Z'")
    .json(path=rdd)
)

Result:

DataFrame[myTimestamp: timestamp]

** Note: inferTimestamp was disabled intentionally due to performance issue. It is recommended to provide schema and avoid schema generation on the fly when ingesting data.

Reference

Migration Guide: SQL, Datasets and DataFrame - Documentation
Interpret timestamp fields in Spark while reading json (timestampFormat) - SPARK-26325
Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4 - SPARK-32130

Spark Infer Timestamp data type from JSON

CATALOG

FEATURED TAGS