Spark: queries with datetime and time zones

I operate from the Netherlands and that makes my time zone Central European Summer Time (CEST). The data I handle is usually stored in UTC time. Whenever I need to crunch some data with Spark I struggle to do the right date conversion, especially around summer or winter time (do I need to add 1 or 2 hours?). In this blog, I'll show how to handle these time zones properly in PySpark.

Using Time Zones

Python comes packed with time zones. Unfortunately CEST is not one of them (CET is), so I just initialize the time zone with Europe/Amsterdam.

import pytz
from datetime import datetime

tz = pytz.timezone("Europe/Amsterdam")

a = tz.localize(datetime(2021, 3, 19, 10))
print(a) # 2021-03-19 10:00:00+01:00

b = tz.localize(datetime(2021, 5, 19, 10))
print(b) # 2021-05-19 10:00:00+02:00

pytz is the Python implementation of the IANA time zone database (also called Olson).

Adding time

I usually work with a start and an end date that are relative to each other, we can use timedelta to do calculations with time.

from datetime import timedelta

c = b + timedelta(hours=2)
print(c) # 2021-05-19 12:00:00+02:00

Notice how the time zone is preserved in the calculation.

Start of today with a time zone

Now, let's use the time zone to construct a date for the beginning of the day:

import pytz
from datetime import datetime, date

tz = pytz.timezone("Europe/Amsterdam")

start_of_today = tz.localize(datetime.combine(
  date.today(),
  datetime.min.time()
))

print("Start of today:", start_of_today)
# Start of today: 2021-07-09 00:00:00+02:00

Spark it

Let's create some data with timestamps. For ease of use I'm using strings that will be cast to timestamps:

from pyspark.sql import *

data_df = spark.createDataFrame([
  
  # March 19th
  Row(ts="2021-03-19T07:55:00.000+0000", path="/"),
  Row(ts="2021-03-19T08:55:00.000+0000", path="/winkelmand/"),
  Row(ts="2021-03-19T09:55:00.000+0000", path="/kassa/bedankt/bevestiging"),
  Row(ts="2021-03-19T10:55:00.000+0000", path="/shop/miljuschka-by-wehkamp/"),
  Row(ts="2021-03-19T11:55:00.000+0000", path="/sacha-plateau-sneakers-zwart-16509505/"),
  
  # May 3rd
  Row(ts="2021-05-03T07:51:00.000+0000", path="/zoeken/"),
  Row(ts="2021-05-03T08:51:00.000+0000", path="/kinder-mode/nike/"),
  Row(ts="2021-05-03T09:51:00.000+0000", path="/kassa/bedankt/bevestiging/"),
  Row(ts="2021-05-03T10:51:00.000+0000", path="/kleding-jassen/state-of-art/"),
  Row(ts="2021-05-03T11:51:00.000+0000", path="/life-of-napoleon-dvd-16038356/"),
  
]).withColumn("ts", col("ts").cast("timestamp"))

Now, let's query the data from March 19th:

import pytz
from datetime import datetime, timedelta

tz = pytz.timezone("Europe/Amsterdam")

start = tz.localize(datetime(2021, 3, 19, 10))
end = start + timedelta(hours=2)

print("CEST start", start)
print("UTC  start", start.astimezone(pytz.utc), "\n")

data_df.filter((col("ts") >= start) & (col("ts") < end)).show(20, False)

This results in:

CEST start 2021-03-19 10:00:00+01:00
UTC  start 2021-03-19 09:00:00+00:00 

+-------------------+----------------------------+
|ts                 |path                        |
+-------------------+----------------------------+
|2021-03-19 09:55:00|/kassa/bedankt/bevestiging  |
|2021-03-19 10:55:00|/shop/miljuschka-by-wehkamp/|
+-------------------+----------------------------+

March 19th is not on summer time, so the time zone is +01:00, which means an hour of difference with UTC. Let's do the same query with a timestamp in summer, in this case May 3rd:

import pytz
from datetime import datetime, timedelta

tz = pytz.timezone("Europe/Amsterdam")

start = tz.localize(datetime(2021, 5, 3, 10))
end = start + timedelta(hours=2)

print("CEST start", start)
print("UTC  start", start.astimezone(pytz.utc), "\n")

data_df.filter((col("ts") >= start) & (col("ts") < end)).show(20, False)

Which results in:

CEST start 2021-05-03 10:00:00+02:00
UTC  start 2021-05-03 08:00:00+00:00 

+-------------------+---------------------------+
|ts                 |path                       |
+-------------------+---------------------------+
|2021-05-03 08:51:00|/kinder-mode/nike/         |
|2021-05-03 09:51:00|/kassa/bedankt/bevestiging/|
+-------------------+---------------------------+

Now the time zone is +02:00, which is 2 hours of difference with UTC. The results start from 08:00.

Change time zone display

We can make it easier by changing the default time zone on Spark:

spark.conf.set("spark.sql.session.timeZone", "Europe/Amsterdam")

When we now display (Databricks) or show, it will show the result in the Dutch time zone.

Conclusion

Keep calm and use time zones, don't subtract hours manually. Use the pytz library to init your datetime objects with a time zone. Spark will take care of the rest. You might want to set the Spark config to your own time zone to make sure the output is easier to read.

Further reading

The following reading might interest you as well:

Improvements

2021-07-28: Added XKCD comic and Jira ticket link.
2021-07-09: Added the Start of today with a time zone section.
2021-05-05: Original article was written.

expand_less