Apache Spark provides a feature to infer data schema based on the incoming data. However, in Spark version 3.1.2, it wrongly interprets the Timestamp field as String data type.
1
2
3
4
5
6
7
8
9
10
import pyspark
spark = pyspark.sql.SparkSession.builder.appName("MyApp").getOrCreate()
line = '{"myTimestamp" : "2017-09-30 04:53:39.412496Z"}'
rdd = spark.sparkContext.parallelize([line])
(
spark.read
.option("timestampFormat", "yyyy-MM-dd HH:mm:ss.SSSSSS'Z'")
.json(path=rdd)
)
Result:
1
DataFrame[myTimestamp: string]
Solution
After some googling, I found out that there is an option inferTimestamp
have to be enabled to allow Spark recognize timestamp data. This option is disabled by default in Spark 3.0.1 and above.
1
2
3
4
5
6
7
8
line = '{"myTimestamp" : "2017-09-30 04:53:39.412496Z"}'
rdd = spark.sparkContext.parallelize([line])
(
spark.read
.option("inferTimestamp", "true")
.option("timestampFormat", "yyyy-MM-dd HH:mm:ss.SSSSSS'Z'")
.json(path=rdd)
)
Result:
1
DataFrame[myTimestamp: timestamp]
** Note: inferTimestamp
was disabled intentionally due to performance issue. It is recommended to provide schema and avoid schema generation on the fly when ingesting data.
Reference
- Migration Guide: SQL, Datasets and DataFrame - Documentation
- Interpret timestamp fields in Spark while reading json (timestampFormat) - SPARK-26325
- Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4 - SPARK-32130
-
Previous
Running Apache Spark standalone cluster in one machine -
Next
Delta Lake - The Next Gen Data Lake