Read the Python Archives at KeesTalksTech

dev containers, Docker, Python, Visual Studio Code

Jupyter Notebooks + VSCode Dev Container with Puppeteer support

Let’s run Jupyter notebooks in a Visual Studio Code development container, so we keep our host system clean and our development setup replicable. We’re building a scraper, so let’s add support for Puppeteer (pyppeteer) as well!

Databricks / Spark, Kinesis, PySpark

Parsing JSON data from S3 (Kinesis) with Spark

Yesterday we’ve encountered a curious problem: we needed use Spark to parse JSON data that was produced by AWS Kinesis. Unfortunately, the data was in a concatenated JSON format. Spark does not support that format, so we had to come up with something our selves.

Databricks / Spark, PySpark

Plotting a grid of images in Databricks

While working in Databricks, I needed to plot some images. I wrote some code that does this in IPython notebooks, but nothing that works on a Dataframe. I decided to change the code a bit, so it works in Databricks. This solution uses PIL and Matplotlib.

Databricks / Spark, PySpark

Spark: extract fields from an XML column

Last week we wanted to parse some XML data with Spark. We have a column with unstructured XML documents and we need to extract text fields from it. This article shows how you can extract data in Spark using a UDF and ElementTree, the Python XML parser.

Databricks / Spark, PySpark

Databricks: CSV, secrets management 🤫 and FTP

This week we’ve been working on processing the access logs from Cloudflare with Databricks (Spark). We now have a job that generates a huge CSV file (+1GB) and sends it on towards by FTP for further processing with an external tool. Creating a DataFrame with the right data was easy. Now, let’s explore how to do a CSV export, secrets management and an FTP transfer!

Databricks / Spark, PySpark

Spark: queries with datetime and time zones

I operate from the Netherlands and that makes my time zone Central European Summer Time (CEST). The data I handle is usually stored in UTC time. Whenever I need to crunch some data with Spark I struggle to do the right date conversion, especially around summer or winter time (do I need to add 1 or 2 hours?). In this blog, I’ll show how to handle these time zones properly in PySpark.

Python

Python utility function: retry with exponential backoff

To make a setup more resilient we could allow for actions to be retried when they fail. We should not “hammer” our underlaying systems, so it is wise to wait a bit before retrying (exponential backoff). Let’s see how something like this could be done in Python. Note: this only works if actions are idempotent and you can afford to wait.

bash, Projects, Python, Redis

Reading 400k+ key/values from Redis fast

At Wehkamp we use Redis a lot. It is fast, available and implemented as a managed AWS service called ElastiCache. Sometimes we need to extract data from Redis, and usually I use the redis-cli to interact from the command-line. But what if you need to get the values of 400k+ keys? What would you do? Is there an effective way to query multiple key/values from Redis?

Python

Cropping model images using PIL

Let’s see if we can use PIL to crop model images and resize them to a 2:3 ratio using Python Image Library (PIL). When all images on an overview are the same ratio, the overview looks way nicer. And… let’s try to make the model on the image, the center of the image.

Jupyter, Python

Plotting a grid of PIL images in Jupyter

When working with images in a Python notebook I like to visualize them on a grid. Just calling display is not enough as it renders the images underneath each other. Let’s use Matplotlib to generate a single image with an image grid on it.