Databricks / Spark

Databricks is a company founded by the original creators of Apache Spark. Databricks develops a web-based platform for working with Spark, that provides automated cluster management and IPython-style notebooks.

Apache Spark is an open-source distributed general-purpose cluster-computing framework. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.

There are 14 articles tagged with Databricks / Spark.

Databricks / Spark

Parsing JSON data from S3 (Kinesis) with Spark

Yesterday we’ve encountered a curious problem: we needed use Spark to parse JSON data that was produced by AWS Kinesis. Unfortunately, the data was in a concatenated JSON format. Spark does not support that format, so we had to come up with something our selves.

Read article

Databricks / Spark

Plotting a grid of images in Databricks

While working in Databricks, I needed to plot some images. I wrote some code that does this in IPython notebooks, but nothing that works on a Dataframe. I decided to change the code a bit, so it works in Databricks. This solution uses PIL and Matplotlib.

Read article

Databricks / Spark

Spark: extract fields from an XML column

Last week we wanted to parse some XML data with Spark. We have a column with unstructured XML documents and we need to extract text fields from it. This article shows how you can extract data in Spark using a UDF and ElementTree, the Python XML parser.

Read article

Databricks / Spark

Databricks: CSV, secrets management 🤫 and FTP

This week we’ve been working on processing the access logs from Cloudflare with Databricks (Spark). We now have a job that generates a huge CSV file (+1GB) and sends it on towards by FTP for further processing with an external tool. Creating a DataFrame with the right data was easy. Now, let’s explore how to do a CSV export, secrets management and an FTP transfer!

Read article

Databricks / Spark

Spark: queries with datetime and time zones

I operate from the Netherlands and that makes my time zone Central European Summer Time (CEST). The data I handle is usually stored in UTC time. Whenever I need to crunch some data with Spark I struggle to do the right date conversion, especially around summer or winter time (do I need to add 1 or 2 hours?). In this blog, I’ll show how to handle these time zones properly in PySpark.

Read article

Databricks / Spark

Spark: replace array with IDs with values; or: how to join objects?

This week we’ve been looking at joining two huge tables in Spark into a single table. It turns out that it is not a straightforward exercise to join data based on an array of IDs. In this blog I’ll show one way of doing this.

Read article

Amazon S3, Databricks / Spark

Streaming a Kafka topic in a Delta table on S3 using Spark Structured Streaming

Our data strategy specifies that we should store data on S3 for further processing. Raw S3 data is not the best way of dealing with data on Spark, though. In this blog I’ll show how you can use Spark Structured Streaming to write JSON records of a Kafka topic into a Delta table.

Read article

Databricks / Spark

Easy Spark optimization for max record: aggregate instead of join?

There is a lot of code that needs to make a selection based on a maximum value. One example are Kafka reads: we only want the latest offset for each key, because that’s the latest record. What is the fastest way of doing this?

Read article

Databricks / Spark

Add more color to the Python code of your Databricks notebook [retired]

Tired of the dull Python syntax highlighting in Databricks? Just copy this code into your Magic CSS editor, change it (to your own style), pin it & enjoy!

Read article

Databricks / Spark

Kafka, Spark and schema inference

At Wehkamp we use Apache Kafka in our event driven service architecture. It handles high loads of messages really well. We use Apache Spark to run analysis. From time to time, I need to read a Kafka topic into my Databricks notebook. In this article, I’ll show what I use to read from a Kafka topic that has no schema attached to it. We’ll also dive into how we can render the JSON schema in a human-readable format.

Read article