Amazon S3

Amazon S3 or Amazon Simple Storage Service is a service offered by Amazon Web Services (AWS) that provides object storage through a web service interface. Amazon S3 uses the same scalable storage infrastructure that Amazon.com uses to run its global e-commerce network.

Amazon S3 can be employed to store any type of object which allows for uses like storage for Internet applications, backup and recovery, disaster recovery, data archives, data lakes for analytics, and hybrid cloud storage.

Amazon S3, bash

Harmonizing Team tags on AWS S3 Buckets

At Wehkamp we use many – many – buckets! To do FinOps correctly, it is important we’re able to determine which teams own which buckets. In this article I’ll discuss how to detect Team tags that are not correct and apply the correct ones. We’re using a combination of Bash, AWS CLI, CSV and JQ.

Amazon S3, Automation, Node.js, Scripting

Using the S3P API to copy 1.3M of 5M of AWS S3 keys

This week we had to exfil some data out of a bucket with 5M+ of keys. After doing some calculations and testing with a Bash script that used AWS cli, we decided to go a more performant route and use s3p. They claim to be 5-50 times faster than AWS cli 😊.

Amazon S3, Databricks / Spark, Kafka, PySpark

Streaming a Kafka topic in a Delta table on S3 using Spark Structured Streaming

Our data strategy specifies that we should store data on S3 for further processing. Raw S3 data is not the best way of dealing with data on Spark, though. In this blog I’ll show how you can use Spark Structured Streaming to write JSON records of a Kafka topic into a Delta table.

Amazon S3, Databricks / Spark, PySpark

Caching resized images on S3 with Databricks

When you are training a machine learning image classification model, you often need to resize the images your dataset into smaller ones. When you retrain your model on new data, you resize the images once more. In this blog I’ll share how S3 can be used to cache the resized images.

Amazon S3, Automation, AWS, Python

Trigger Lambda for large S3 Bucket with SQS

At Wehkamp we use AWS Lambda to classify images on S3. The Lambda is triggered when a new image is uploaded to the S3 bucket. Currently we have over 6.400.000 images in the bucket. Now we would like to run the Lambda for all images of the bucket. In this blog I’ll show how we did this with a Python 3.6 script.

Amazon S3, AWS Lambda, bash, Python

AWS Lambda Size: PIL+TF+Keras+Numpy?

At Wehkamp we’ve been using machine learning for a while now. We’re training models in Databricks (Spark) and Keras. This produces a Keras file that we use to make the actual predictions. Training is one thing, but getting them to production is quite another!

The main problem we’ve faced was that it was too big to actually fit into a lambda. This blogs shows how we’ve dealt with that problem.