I operate from the Netherlands and that makes my time zone Central European Summer Time (CEST). The data I handle is usually stored in UTC time. Whenever I need to crunch some data with Spark I struggle to do the right date conversion, especially around summer or winter time (do I need to add 1 or 2 hours?). In this blog, I'll show how to handle these time zones properly in PySpark.
Using Time Zones
Python comes packed with time zones. Unfortunately CEST
is not one of them (CET
is), so I just initialize the time zone with Europe/Amsterdam
.
import pytz
from datetime import datetime
tz = pytz.timezone("Europe/Amsterdam")
a = tz.localize(datetime(2021, 3, 19, 10))
print(a) # 2021-03-19 10:00:00+01:00
b = tz.localize(datetime(2021, 5, 19, 10))
print(b) # 2021-05-19 10:00:00+02:00
pytz
is the Python implementation of the IANA time zone database (also called Olson).
Adding time
I usually work with a start and an end date that are relative to each other, we can use timedelta
to do calculations with time.
from datetime import timedelta
c = b + timedelta(hours=2)
print(c) # 2021-05-19 12:00:00+02:00
Notice how the time zone is preserved in the calculation.
Start of today with a time zone
Now, let's use the time zone to construct a date for the beginning of the day:
import pytz
from datetime import datetime, date
tz = pytz.timezone("Europe/Amsterdam")
start_of_today = tz.localize(datetime.combine(
date.today(),
datetime.min.time()
))
print("Start of today:", start_of_today)
# Start of today: 2021-07-09 00:00:00+02:00
Spark it
Let's create some data with timestamps. For ease of use I'm using strings that will be cast to timestamps:
from pyspark.sql import *
data_df = spark.createDataFrame([
# March 19th
Row(ts="2021-03-19T07:55:00.000+0000", path="/"),
Row(ts="2021-03-19T08:55:00.000+0000", path="/winkelmand/"),
Row(ts="2021-03-19T09:55:00.000+0000", path="/kassa/bedankt/bevestiging"),
Row(ts="2021-03-19T10:55:00.000+0000", path="/shop/miljuschka-by-wehkamp/"),
Row(ts="2021-03-19T11:55:00.000+0000", path="/sacha-plateau-sneakers-zwart-16509505/"),
# May 3rd
Row(ts="2021-05-03T07:51:00.000+0000", path="/zoeken/"),
Row(ts="2021-05-03T08:51:00.000+0000", path="/kinder-mode/nike/"),
Row(ts="2021-05-03T09:51:00.000+0000", path="/kassa/bedankt/bevestiging/"),
Row(ts="2021-05-03T10:51:00.000+0000", path="/kleding-jassen/state-of-art/"),
Row(ts="2021-05-03T11:51:00.000+0000", path="/life-of-napoleon-dvd-16038356/"),
]).withColumn("ts", col("ts").cast("timestamp"))
Now, let's query the data from March 19th:
import pytz
from datetime import datetime, timedelta
tz = pytz.timezone("Europe/Amsterdam")
start = tz.localize(datetime(2021, 3, 19, 10))
end = start + timedelta(hours=2)
print("CEST start", start)
print("UTC start", start.astimezone(pytz.utc), "\n")
data_df.filter((col("ts") >= start) & (col("ts") < end)).show(20, False)
This results in:
CEST start 2021-03-19 10:00:00+01:00
UTC start 2021-03-19 09:00:00+00:00
+-------------------+----------------------------+
|ts |path |
+-------------------+----------------------------+
|2021-03-19 09:55:00|/kassa/bedankt/bevestiging |
|2021-03-19 10:55:00|/shop/miljuschka-by-wehkamp/|
+-------------------+----------------------------+
March 19th is not on summer time, so the time zone is +01:00
, which means an hour of difference with UTC
. Let's do the same query with a timestamp in summer, in this case May 3rd:
import pytz
from datetime import datetime, timedelta
tz = pytz.timezone("Europe/Amsterdam")
start = tz.localize(datetime(2021, 5, 3, 10))
end = start + timedelta(hours=2)
print("CEST start", start)
print("UTC start", start.astimezone(pytz.utc), "\n")
data_df.filter((col("ts") >= start) & (col("ts") < end)).show(20, False)
Which results in:
CEST start 2021-05-03 10:00:00+02:00
UTC start 2021-05-03 08:00:00+00:00
+-------------------+---------------------------+
|ts |path |
+-------------------+---------------------------+
|2021-05-03 08:51:00|/kinder-mode/nike/ |
|2021-05-03 09:51:00|/kassa/bedankt/bevestiging/|
+-------------------+---------------------------+
Now the time zone is +02:00
, which is 2 hours of difference with UTC
. The results start from 08:00
.
Change time zone display
We can make it easier by changing the default time zone on Spark:
spark.conf.set("spark.sql.session.timeZone", "Europe/Amsterdam")
When we now display
(Databricks) or show
, it will show the result in the Dutch time zone.
Conclusion
Keep calm and use time zones, don't subtract hours manually. Use the pytz
library to init your datetime
objects with a time zone. Spark will take care of the rest. You might want to set the Spark config to your own time zone to make sure the output is easier to read.
Further reading
The following reading might interest you as well:
- A Comprehensive Look at Dates and Timestamps in Apache Spark™ 3.0
- pytz - brings the Olson tz database into Python
- [SPARK-32123][Python] Setting `spark.sql.session.timeZone` only partially respected
Improvements
2021-07-28: Added XKCD comic and Jira ticket link.
2021-07-09: Added the Start of today with a time zone section.
2021-05-05: Original article was written.