Databricks / Spark
Databricks is a company founded by the original creators of Apache Spark. Databricks develops a web-based platform for working with Spark, that provides automated cluster management and IPython-style notebooks.
Apache Spark is an open-source distributed general-purpose cluster-computing framework. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
Caching resized images on S3 with Databricks
When you are training a machine learning image classification model, you often need to resize the images your dataset into smaller ones. When you retrain your model on new data, you resize the images once more. In this blog I’ll share how S3 can be used to cache the resized images.
Sorting an array of a complex data type in Spark
Today we’ll be looking at sorting and reducing an array of a complex data type. I’m using Databricks to do Spark, but I’m sure the code is compatible. I’ll be using Spark SQL to show the steps. I’ve tried to keep the data as simple as possible. The example should apply to scenarios that are more complex.
Adding True/False and list value widgets to your Databricks notebook
As an engineer, I love to parametrise my applications. That’s why I love the widget-feature of Databricks notebooks, which allows me to do this with a nice UI. In this blog I’ll explore how to build a True/False widget and a list widget. I also show how to validate the values of required fields.