Databricks / Spark
Databricks is a company founded by the original creators of Apache Spark. Databricks develops a web-based platform for working with Spark, that provides automated cluster management and IPython-style notebooks.
Apache Spark is an open-source distributed general-purpose cluster-computing framework. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
Spark: replace array with IDs with values; or: how to join objects?
This week we’ve been looking at joining two huge tables in Spark into a single table. It turns out that it is not a straightforward exercise to join data based on an array of IDs. In this blog I’ll show one way of doing this.
Streaming a Kafka topic in a Delta table on S3 using Spark Structured Streaming
Our data strategy specifies that we should store data on S3 for further processing. Raw S3 data is not the best way of dealing with data on Spark, though. In this blog I’ll show how you can use Spark Structured Streaming to write JSON records of a Kafka topic into a Delta table.
Easy Spark optimization for max record: aggregate instead of join?
There is a lot of code that needs to make a selection based on a maximum value. One example are Kafka reads: we only want the latest offset for each key, because that’s the latest record. What is the fastest way of doing this?
Add more color to the Python code of your Databricks notebook
Tired of the dull Python syntax highlighting in Databricks? Just copy this code into your Magic CSS editor, change it (to your own style), pin it & enjoy!
Kafka, Spark and schema inference
At Wehkamp we use Apache Kafka in our event driven service architecture. It handles high loads of messages really well. We use Apache Spark to run analysis. From time to time, I need to read a Kafka topic into my Databricks notebook. In this article, I’ll show what I use to read from a Kafka topic that has no schema attached to it. We’ll also dive into how we can render the JSON schema in a human-readable format.
Simple Python code to send message to Slack channel (without packages)
Last week I was working on a Databricks script that needed to produce a Slack message as its final outcome. I lifted some code that used a Slack client that was PIP-installed. Unfortunately, I could not use the package on my cluster. Fortunately, the Slack API is so simple, that you don’t really need a package to post a simple message to a channel. In this blog I’ll show you the simplest way of producing awesome messages in Slack.
Caching resized images on S3 with Databricks
When you are training a machine learning image classification model, you often need to resize the images your dataset into smaller ones. When you retrain your model on new data, you resize the images once more. In this blog I’ll share how S3 can be used to cache the resized images.
Sorting an array of a complex data type in Spark
Today we’ll be looking at sorting and reducing an array of a complex data type. I’m using Databricks to do Spark, but I’m sure the code is compatible. I’ll be using Spark SQL to show the steps. I’ve tried to keep the data as simple as possible. The example should apply to scenarios that are more complex.
Adding True/False and list value widgets to your Databricks notebook
As an engineer, I love to parametrise my applications. That’s why I love the widget-feature of Databricks notebooks, which allows me to do this with a nice UI. In this blog I’ll explore how to build a True/False widget and a list widget. I also show how to validate the values of required fields.