While working in Databricks, I needed to plot some images. I wrote some code that does this in IPython notebooks, but nothing that works on a Dataframe. I decided to change the code a bit, so it works in Databricks. This solution uses PIL and Matplotlib.
At Wehkamp we use decoration a lot. Decoration is a nice way of separating concerns from the actual code. Most of our repositories need the same set of decorators: exception logging, latency metrics and Jaeger spans. In this article I’ll be joining these 3 types of decorator into a single Swiss Army Knife decorator: one decorator to rule them all.
This week we’ve been working on processing the access logs from Cloudflare with Databricks (Spark). We now have a job that generates a huge CSV file (+1GB) and sends it on towards by FTP for further processing with an external tool. Creating a DataFrame with the right data was easy. Now, let’s explore how to do a CSV export, secrets management and an FTP transfer!
I operate from the Netherlands and that makes my time zone Central European Summer Time (CEST). The data I handle is usually stored in UTC time. Whenever I need to crunch some data with Spark I struggle to do the right date conversion, especially around summer or winter time (do I need to add 1 or 2 hours?). In this blog, I’ll show how to handle these time zones properly in PySpark.