# Spark: queries with datetime and time zones **Date:** 2021-05-05 **Author:** Kees C. Bakker **Categories:** Databricks / Spark **Original:** https://keestalkstech.com/spark-queries-with-datetime-and-time-zones/ ![Spark: queries with datetime and time zones](https://keestalkstech.com/wp-content/uploads/2021/05/juliana-kozoski-IoQioGLrz3Y-unsplash.jpg) --- I operate from the Netherlands and that makes my time zone Central European Summer Time (CEST). The data I handle is usually stored in UTC time. Whenever I need to crunch some data with Spark I struggle to do the right date conversion, especially around [summer or winter time](https://www.wikiwand.com/en/Summer_time_in_Europe) (do I need to add 1 or 2 hours?). In this blog, I'll show how to handle these time zones properly in PySpark. ## Using Time Zones Python comes packed with time zones. Unfortunately `CEST` is not one of them (`CET` is), so I just initialize the time zone with `Europe/Amsterdam`. ```py import pytz from datetime import datetime tz = pytz.timezone("Europe/Amsterdam") a = tz.localize(datetime(2021, 3, 19, 10)) print(a) # 2021-03-19 10:00:00+01:00 b = tz.localize(datetime(2021, 5, 19, 10)) print(b) # 2021-05-19 10:00:00+02:00 ``` `pytz` is the Python implementation of the [IANA time zone database](https://www.iana.org/time-zones) (also called Olson). ## Adding time I usually work with a *start* and an *end* date that are relative to each other, we can use `timedelta` to do calculations with time. ```py from datetime import timedelta c = b + timedelta(hours=2) print(c) # 2021-05-19 12:00:00+02:00 ``` Notice how the time zone is preserved in the calculation. ## Start of today *with* a time zone Now, let's use the time zone to construct a date for the beginning of the day: ```py import pytz from datetime import datetime, date tz = pytz.timezone("Europe/Amsterdam") start_of_today = tz.localize(datetime.combine( date.today(), datetime.min.time() )) print("Start of today:", start_of_today) # Start of today: 2021-07-09 00:00:00+02:00 ``` ## Spark it Let's create some data with timestamps. For ease of use I'm using strings that will be cast to timestamps: ```py from pyspark.sql import * data_df = spark.createDataFrame([ # March 19th Row(ts="2021-03-19T07:55:00.000+0000", path="/"), Row(ts="2021-03-19T08:55:00.000+0000", path="/winkelmand/"), Row(ts="2021-03-19T09:55:00.000+0000", path="/kassa/bedankt/bevestiging"), Row(ts="2021-03-19T10:55:00.000+0000", path="/shop/miljuschka-by-wehkamp/"), Row(ts="2021-03-19T11:55:00.000+0000", path="/sacha-plateau-sneakers-zwart-16509505/"), # May 3rd Row(ts="2021-05-03T07:51:00.000+0000", path="/zoeken/"), Row(ts="2021-05-03T08:51:00.000+0000", path="/kinder-mode/nike/"), Row(ts="2021-05-03T09:51:00.000+0000", path="/kassa/bedankt/bevestiging/"), Row(ts="2021-05-03T10:51:00.000+0000", path="/kleding-jassen/state-of-art/"), Row(ts="2021-05-03T11:51:00.000+0000", path="/life-of-napoleon-dvd-16038356/"), ]).withColumn("ts", col("ts").cast("timestamp")) ``` Now, let's query the data from March 19^th: ```py import pytz from datetime import datetime, timedelta tz = pytz.timezone("Europe/Amsterdam") start = tz.localize(datetime(2021, 3, 19, 10)) end = start + timedelta(hours=2) print("CEST start", start) print("UTC start", start.astimezone(pytz.utc), "\n") data_df.filter((col("ts") >= start) & (col("ts") < end)).show(20, False) ``` This results in: ```spark_output CEST start 2021-03-19 10:00:00+01:00 UTC start 2021-03-19 09:00:00+00:00 +-------------------+----------------------------+ |ts |path | +-------------------+----------------------------+ |2021-03-19 09:55:00|/kassa/bedankt/bevestiging | |2021-03-19 10:55:00|/shop/miljuschka-by-wehkamp/| +-------------------+----------------------------+ ``` March 19^th is *not* on summer time, so the time zone is `+01:00`, which means an hour of difference with `UTC`. Let's do the same query with a timestamp in summer, in this case May 3^rd: ```py import pytz from datetime import datetime, timedelta tz = pytz.timezone("Europe/Amsterdam") start = tz.localize(datetime(2021, 5, 3, 10)) end = start + timedelta(hours=2) print("CEST start", start) print("UTC start", start.astimezone(pytz.utc), "\n") data_df.filter((col("ts") >= start) & (col("ts") < end)).show(20, False) ``` Which results in: ```spark_output CEST start 2021-05-03 10:00:00+02:00 UTC start 2021-05-03 08:00:00+00:00 +-------------------+---------------------------+ |ts |path | +-------------------+---------------------------+ |2021-05-03 08:51:00|/kinder-mode/nike/ | |2021-05-03 09:51:00|/kassa/bedankt/bevestiging/| +-------------------+---------------------------+ ``` Now the time zone is `+02:00`, which is 2 hours of difference with `UTC`. The results start from `08:00`. ### Change time zone display We can make it easier by changing the default time zone on Spark: ```py spark.conf.set("spark.sql.session.timeZone", "Europe/Amsterdam") ``` When we now `display` (Databricks) or `show`, it will show the result in the Dutch time zone. ## Conclusion Keep calm and use time zones, don't subtract hours manually. Use the `pytz` library to init your `datetime` objects with a time zone. Spark will take care of the rest. You might want to set the Spark config to your own time zone to make sure the output is easier to read. ![](https://keestalkstech.com/wp-content/uploads/2021/07/supervillain_plan1.png) *XKCD 1883: Supervillain Plan* ## Further reading The following reading might interest you as well: ## Improvements 2021-07-28: Added XKCD comic and Jira ticket link. 2021-07-09: Added the [Start of today *with* a time zone](#start-of-today-with-a-time-zone) section. 2021-05-05: Original article was written.