Our data strategy specifies that we should store data on S3 for further processing. Raw S3 data is not the best way of dealing with data on Spark, though. In this blog I’ll show how you can use Spark Structured Streaming to write JSON records of a Kafka topic into a Delta table.
At Wehkamp we use Apache Kafka in our event driven service architecture. It handles high loads of messages really well. We use Apache Spark to run analysis. From time to time, I need to read a Kafka topic into my Databricks notebook. In this article, I’ll show what I use to read from a Kafka topic that has no schema attached to it. We’ll also dive into how we can render the JSON schema in a human-readable format.
Last week I was working on a Databricks script that needed to produce a Slack message as its final outcome. I lifted some code that used a Slack client that was PIP-installed. Unfortunately, I could not use the package on my cluster. Fortunately, the Slack API is so simple, that you don’t really need a package to post a simple message to a channel. In this blog I’ll show you the simplest way of producing awesome messages in Slack.
I like to validate my application configuration upon startup. Especially when doing local development, I want to know which application settings are missing. I also like to know where I should add them. This blog shows how to implement validation of your configuration classes using data annotations.
Today we’ll be looking at sorting and reducing an array of a complex data type. I’m using Databricks to do Spark, but I’m sure the code is compatible. I’ll be using Spark SQL to show the steps. I’ve tried to keep the data as simple as possible. The example should apply to scenarios that are more complex.