# Spark: queries with datetime and time zones

**Date:** 2021-05-05  
**Author:** Kees C. Bakker  
**Categories:** Databricks / Spark  
**Original:** https://keestalkstech.com/spark-queries-with-datetime-and-time-zones/

![Spark: queries with datetime and time zones](https://keestalkstech.com/wp-content/uploads/2021/05/juliana-kozoski-IoQioGLrz3Y-unsplash.jpg)

---

I operate from the Netherlands and that makes my time zone Central European Summer Time (CEST). The data I handle is usually stored in UTC time. Whenever I need to crunch some data with Spark I struggle to do the right date conversion, especially around [summer or winter time](https://www.wikiwand.com/en/Summer_time_in_Europe) (do I need to add 1 or 2 hours?). In this blog, I'll show how to handle these time zones properly in PySpark.

## Using Time Zones

Python comes packed with time zones. Unfortunately `CEST` is not one of them (`CET` is), so I just initialize the time zone with `Europe/Amsterdam`.

```py
import pytz
from datetime import datetime

tz = pytz.timezone("Europe/Amsterdam")

a = tz.localize(datetime(2021, 3, 19, 10))
print(a) # 2021-03-19 10:00:00+01:00

b = tz.localize(datetime(2021, 5, 19, 10))
print(b) # 2021-05-19 10:00:00+02:00
```

`pytz` is the Python implementation of the [IANA time zone database](https://www.iana.org/time-zones) (also called Olson).

## Adding time

I usually work with a *start* and an *end* date that are relative to each other, we can use `timedelta` to do calculations with time.

```py
from datetime import timedelta

c = b + timedelta(hours=2)
print(c) # 2021-05-19 12:00:00+02:00
```

Notice how the time zone is preserved in the calculation.

## Start of today *with* a time zone

Now, let's use the time zone to construct a date for the beginning of the day:

```py
import pytz
from datetime import datetime, date

tz = pytz.timezone("Europe/Amsterdam")

start_of_today = tz.localize(datetime.combine(
  date.today(),
  datetime.min.time()
))

print("Start of today:", start_of_today)
# Start of today: 2021-07-09 00:00:00+02:00
```

## Spark it

Let's create some data with timestamps. For ease of use I'm using strings that will be cast to timestamps:

```py
from pyspark.sql import *

data_df = spark.createDataFrame([
  
  # March 19th
  Row(ts="2021-03-19T07:55:00.000+0000", path="/"),
  Row(ts="2021-03-19T08:55:00.000+0000", path="/winkelmand/"),
  Row(ts="2021-03-19T09:55:00.000+0000", path="/kassa/bedankt/bevestiging"),
  Row(ts="2021-03-19T10:55:00.000+0000", path="/shop/miljuschka-by-wehkamp/"),
  Row(ts="2021-03-19T11:55:00.000+0000", path="/sacha-plateau-sneakers-zwart-16509505/"),
  
  # May 3rd
  Row(ts="2021-05-03T07:51:00.000+0000", path="/zoeken/"),
  Row(ts="2021-05-03T08:51:00.000+0000", path="/kinder-mode/nike/"),
  Row(ts="2021-05-03T09:51:00.000+0000", path="/kassa/bedankt/bevestiging/"),
  Row(ts="2021-05-03T10:51:00.000+0000", path="/kleding-jassen/state-of-art/"),
  Row(ts="2021-05-03T11:51:00.000+0000", path="/life-of-napoleon-dvd-16038356/"),
  
]).withColumn("ts", col("ts").cast("timestamp"))
```

Now, let's query the data from March 19<sup>th</sup>:

```py
import pytz
from datetime import datetime, timedelta

tz = pytz.timezone("Europe/Amsterdam")

start = tz.localize(datetime(2021, 3, 19, 10))
end = start + timedelta(hours=2)

print("CEST start", start)
print("UTC  start", start.astimezone(pytz.utc), "\n")

data_df.filter((col("ts") >= start) & (col("ts") < end)).show(20, False)
```

This results in:

```spark_output
CEST start 2021-03-19 10:00:00+01:00
UTC  start 2021-03-19 09:00:00+00:00 

+-------------------+----------------------------+
|ts                 |path                        |
+-------------------+----------------------------+
|2021-03-19 09:55:00|/kassa/bedankt/bevestiging  |
|2021-03-19 10:55:00|/shop/miljuschka-by-wehkamp/|
+-------------------+----------------------------+
```

March 19<sup>th</sup> is *not* on summer time, so the time zone is `+01:00`, which means an hour of difference with `UTC`. Let's do the same query with a timestamp in summer, in this case May 3<sup>rd</sup>:

```py
import pytz
from datetime import datetime, timedelta

tz = pytz.timezone("Europe/Amsterdam")

start = tz.localize(datetime(2021, 5, 3, 10))
end = start + timedelta(hours=2)

print("CEST start", start)
print("UTC  start", start.astimezone(pytz.utc), "\n")

data_df.filter((col("ts") >= start) & (col("ts") < end)).show(20, False)
```

Which results in:

```spark_output
CEST start 2021-05-03 10:00:00+02:00
UTC  start 2021-05-03 08:00:00+00:00 

+-------------------+---------------------------+
|ts                 |path                       |
+-------------------+---------------------------+
|2021-05-03 08:51:00|/kinder-mode/nike/         |
|2021-05-03 09:51:00|/kassa/bedankt/bevestiging/|
+-------------------+---------------------------+
```

Now the time zone is `+02:00`, which is 2 hours of difference with `UTC`. The results start from `08:00`.

### Change time zone display

We can make it easier by changing the default time zone on Spark:

```py
spark.conf.set("spark.sql.session.timeZone", "Europe/Amsterdam")
```

When we now `display` (Databricks) or `show`, it will show the result in the Dutch time zone.

## Conclusion

Keep calm and use time zones, don't subtract hours manually. Use the `pytz` library to init your `datetime` objects with a time zone. Spark will take care of the rest. You might want to set the Spark config to your own time zone to make sure the output is easier to read.

![](https://keestalkstech.com/wp-content/uploads/2021/07/supervillain_plan1.png)
*XKCD 1883: Supervillain Plan*

## Further reading

The following reading might interest you as well:

## Improvements

2021-07-28: Added XKCD comic and Jira ticket link.
2021-07-09: Added the [Start of today *with* a time zone](#start-of-today-with-a-time-zone) section.
2021-05-05: Original article was written.
