Let’s face it, using positional arguments like $1 and $2 for your arguments is not very descriptive and not very flexible. There is a better way to supply arguments: with this simple trick you’ll get named arguments in your-script, which is way better 🤓.
Yesterday we’ve encountered a curious problem: we needed use Spark to parse JSON data that was produced by AWS Kinesis. Unfortunately, the data was in a concatenated JSON format. Spark does not support that format, so we had to come up with something our selves.
I love the group by SQL command, and sometimes I really miss these SQL-like functions in languages like JavaScript. In this article I explore how to group arrays into maps and how we can transform those maps into structures that work for us. We will leverage TypeScript generics to do transformation on data structures in such a way that the end-result remains strongly typed: this will make your code more maintainable.
While working in Databricks, I needed to plot some images. I wrote some code that does this in IPython notebooks, but nothing that works on a Dataframe. I decided to change the code a bit, so it works in Databricks. This solution uses PIL and Matplotlib.
At Wehkamp we use decoration a lot. Decoration is a nice way of separating concerns from the actual code. Most of our repositories need the same set of decorators: exception logging, latency metrics and Jaeger spans. In this article I’ll be joining these 3 types of decorator into a single Swiss Army Knife decorator: one decorator to rule them all.
Adding observability to micro services is vital if you want to discover bottle necks. In this blog, I’ll show how we implemented Jaeger in .NET Core to observe incoming and outgoing requests. We’ll also use a Jaeger decorator to observe spans in classes.
Last week I was working on our new cockpit application, which is essentially a list of links to parts of our Wehkamp platform. The old application was not being maintained, as the React-stack is not something that’s in the skill-set of most engineers. We kept the new cockpit simple: plain old HTML. Of course we wanted to support a nice dark-theme as well. This article shows how simple it is to implement dark mode.
Last week we wanted to parse some XML data with Spark. We have a column with unstructured XML documents and we need to extract text fields from it. This article shows how you can extract data in Spark using a UDF and ElementTree, the Python XML parser.
This week we’ve been working on processing the access logs from Cloudflare with Databricks (Spark). We now have a job that generates a huge CSV file (+1GB) and sends it on towards by FTP for further processing with an external tool. Creating a DataFrame with the right data was easy. Now, let’s explore how to do a CSV export, secrets management and an FTP transfer!
I operate from the Netherlands and that makes my time zone Central European Summer Time (CEST). The data I handle is usually stored in UTC time. Whenever I need to crunch some data with Spark I struggle to do the right date conversion, especially around summer or winter time (do I need to add 1 or 2 hours?). In this blog, I’ll show how to handle these time zones properly in PySpark.