This week we’ve been working on processing the access logs from Cloudflare with Databricks (Spark). We now have a job that generates a huge CSV file (+1GB) and sends it on towards by FTP for further processing with an external tool. Creating a DataFrame with the right data was easy. Now, let’s explore how to do a CSV export, secrets management and an FTP transfer!
I operate from the Netherlands and that makes my time zone Central European Summer Time (CEST). The data I handle is usually stored in UTC time. Whenever I need to crunch some data with Spark I struggle to do the right date conversion, especially around summer or winter time (do I need to add 1 or 2 hours?). In this blog, I’ll show how to handle these time zones properly in PySpark.
To make a setup more resilient we could allow for actions to be retried when they fail. We should not “hammer” our underlaying systems, so it is wise to wait a bit before retrying (exponential backoff). Let’s see how something like this could be done in Python. Note: this only works if actions are idempotent and you can afford to wait.
At Wehkamp we use Redis a lot. It is fast, available and implemented as a managed AWS service called ElastiCache. Sometimes we need to extract data from Redis, and usually I use the redis-cli to interact from the command-line. But what if you need to get the values of 400k+ keys? What would you do? Is there an effective way to query multiple key/values from Redis?
I imagine your first thought is: why? Well, at Wehkamp we do a lot of cross platform development, but sometimes we end up with shell scripts that do stuff with Docker and Python. Usually that’s not a problem for Mac, but for Windows it’s a different thing. I have a MacBook Pro, but I’m a .NET developer, that’s why I prefer Windows, so I run Bootcamp. This article will show how to do Python development in the Windows Subsystem for Linux (WSL) using Visual Studio Code and Docker.