Task: take some website screenshots and put them in a folder (in parallel)

You have a bunch of online services that let you take screenshots of a site and save them in a folder. While it can be very useful to pay for such a system, it is not so hard to create it. Let's use Chrome / Chromium with Puppeteer and Node.js (cluster) to take some snapshots in no-time. We'll use the Puppeteer Cluster package to run multiple threads / workers to grab those screens in parallel. We'll be using TypeScript.

Packages

Let's install the following packages:

npm install puppeteer
npm install puppeteer-cluster
npm install sanitize-filename
npm install tmp
npm install -D @types/tmp

Define the inputs

The Puppeteer Cluster API has many options and building on it is super easy. But I want something that is even more simple and reusable: let's take a bunch of URLs and just save them in a directory as fast as possible.

Let's define a list of URLs we would like to screenshot:

const URLS: Record<string, string> = {
  Homepage: "https://www.wehkamp.nl/",
  Dames: "https://www.wehkamp.nl/damesmode/C21/",
  Heren: "https://www.wehkamp.nl/herenmode/C22/",
  Kinderen: "https://www.wehkamp.nl/kinderen/C23/",
  Baby: "https://www.wehkamp.nl/baby/C50/",
  Beauty: "https://www.wehkamp.nl/mooi-gezond/C29/",
  "Wonen & slapen": "https://www.wehkamp.nl/wonen-slapen/C28/",
  "Koken & tafelen": "https://www.wehkamp.nl/koken-tafelen/C51/",
  Tuin: "https://www.wehkamp.nl/tuin-klussen/C30/",
  Huishouden: "https://www.wehkamp.nl/huishouden/C27/",
  Elektronica: "https://www.wehkamp.nl/elektronica/C26/",
  Speelgoed: "https://www.wehkamp.nl/speelgoed-games/C25/",
  "Sport & vrije tijd": "https://www.wehkamp.nl/sport-vrije-tijd/C24/",
  "Boeken, films & muziek": "https://www.wehkamp.nl/boeken-films-muziek/C31/",
  Cadeaus: "https://www.wehkamp.nl/cadeaushop/C52/",
  Sale: "https://www.wehkamp.nl/sale/OPR/",
  Actie: "https://www.wehkamp.nl/actie/C92/",
  Kleding: "https://www.wehkamp.nl/kleding/C80/",
}

Next, let's define the type we need if we want to process a single URL and call that type our ScreenshotTask:

import { ScreenshotOptions } from "puppeteer"

export type ScreenshotTask = {
  name?: string
  url: string
  width?: number
  height?: number
  screenshotOptions: ScreenshotOptions
  onProcessed?: (settings: ScreenshotTask) => Promise<void>
  onBeforeScreenshot?: (settings: ScreenshotTask, page: Page) => Promise<void>
}

We've borrowed the ScreenshotOptions from the Puppeteer project. We'll use the width and height to change the view-port (can be used for mobile screenshots). If you set the screenshotOptions.path property, Puppeteer will write the screenshot to that path.

Now, let's imagine the same settings, but now for processing all the URLs:

export type ProcessorOptions = {
  onProcessedItem?: (task: ProcessedScreenshot) => Promise<void>
  onBeforeScreenshot?: (settings: ScreenshotTask, page: Page) => Promise<void>
  screenshotOptions?: ScreenshotOptions
  maxConcurrency?: number
  width?: number
  height?: number
  folder?: string
}

Now that we have our inputs defined, we can create a class that uses the inputs to create a single screenshot.

Taking a single screenshot

We want to take multiple screenshots in parallel, so we should be able the queue a single screenshot. This class will control the cluster flow (starting, stopping and waiting for everything to finish). We're trying to create an interaction that is simple for the end user, so I would like to abstract things away:

import { Cluster } from "puppeteer-cluster"

export class ScreenshotProcessor {
  private cluster: Cluster<ScreenshotTask, void>

  constructor(private maxConcurrency: number) {}

  public async start() {
    if (!this.cluster) {
      this.cluster = await Cluster.launch({
        concurrency: Cluster.CONCURRENCY_BROWSER,
        maxConcurrency: this.maxConcurrency,
      })
    }
  }

  public async stop() {
    if (this.cluster) {
      await this.cluster.idle()
      await this.cluster.close()
      this.cluster = null
    }
  }

  public async queue(task: ScreenshotTask) {
    await this.start()
    await this.cluster.queue(task, async task => {
      if (task.data.height || task.data.width) {
        await task.page.setViewport({
          width: task.data.width || 800,
          height: task.data.height || 600,
        })
      }

      await task.page.goto(task.data.url)
      
      if (task.data.onBeforeScreenshot) {
        await task.data.onBeforeScreenshot(task.data, task.page);
      }

      try {
        await task.page.screenshot(task.data.screenshotOptions)
      } catch (ex) {
        console.error(
          `Error while taking screenshot\nData: ${JSON.stringify(
            task.data
          )}\n${ex}`
        )
        throw ex
      }

      if (task.data.onProcessed) {
        await task.data.onProcessed(task.data)
      }
    })
  }
}

The steps to take the screenshot are:

  1. Make sure the view-port is set.
  2. Visit the URL.
  3. Take a screenshot.
  4. Execute callback if everything went OK.

We could now take multiple screenshots by executing:

;(async () => {
  let processor = new ScreenshotProcessor(8)

  await processor.start()

  Object.entries(URLS).forEach(async ([name, url]) => {
    processor.queue({
      name,
      url,
      screenshotOptions: {
        fullPage: true,
        type: "png",
        path: `c:\\temp\\test\\${name}.png`,
      },
      onProcessed: async task =>
        console.log(
          `${task.name} is now saved at: ${task.screenshotOptions.path}`
        ),
    })
  })

  await processor.stop()
  console.log("Finished")
})()

Provided the folder of each path is present, this will work fine.

Let's make it even easier

The screenshot settings will not change that often, so let's export them as constants:

export const FullPagePngScreenshots = Object.freeze(<ScreenshotOptions>{
  fullPage: true,
  type: "png",
})

export const FullPageJpgScreenshots = Object.freeze(<ScreenshotOptions>{
  fullPage: true,
  type: "jpeg",
  quality: 80,
})

This way, new people that interact with our code don't have to understand the Puppeteer docs before they get started.

Making sure a destination folder is actually present, is something we could automate! Let's add the following static method to our ScreenshotProcessor:

import sanitize from "sanitize-filename"
import tmp from "tmp"
import fs from "node:fs/promises"
import path from "node:path"
import { ScreenshotOptions } from "puppeteer"

public static async run(
  urls: Record<string, string>,
  options: ProcessorOptions
) {
  // start the processor
  let processor = new ScreenshotProcessor(options.maxConcurrency || 8)
  await processor.start()

  // ensure folder path
  let tmpFolder =
 options.folder || tmp.dirSync({ postfix: "screenshots" }).name
  tmpFolder = path.resolve(tmpFolder)
  await fs.mkdir(tmpFolder, { recursive: true })

  Object.entries(urls).forEach(([name, url]) => {
    let screenshotOptions = Object.assign(
      {},
      options.screenshotOptions || FullPageJpgScreenshots
    ) as ScreenshotOptions

    let tempFilePath = path.join(
      tmpFolder,
      `${sanitize(name)}.${screenshotOptions.type}`
    )
    screenshotOptions.path = tempFilePath

    let task: ScreenshotTask = {
      name,
      url: url,
      onBeforeScreenshot: options.onBeforeScreenshot,
      width: options.width,
      height: options.height,
      screenshotOptions: screenshotOptions,
      onProcessed: async () => {
        if (options.onProcessedItem) {
          await options.onProcessedItem({
            name: name,
            url: url,
            path: tempFilePath,
          })
        }
      },
    }
    processor.queue(task)
  })

  await processor.stop()
}

We take the burden of starting and stopping our processor and the burden of creating a temporary folder into this method. Our example code does the same, but is now way smaller:

;(async () => {
  await ScreenshotProcessor.run(URLS, {
    onProcessedItem: async task => {
      console.log(`${task.name} is now saved at: ${task.path}`)
    },
  })
  console.log("Finished")
})()

So what about concurrency?

When I used hyperfine to benchmark the screenshotting of 18 files, and I got these results for JPG files:

CommandMean [s]Min [s]Max [s]Relative
node dist/index.js jpg 15.461 ± 3.9852.95712.8481.75 ± 1.27
node dist/index.js jpg 23.129 ± 0.0843.0163.2971.00
node dist/index.js jpg 43.924 ± 2.5492.99011.1751.25 ± 0.82
node dist/index.js jpg 85.149 ± 3.3103.04111.0431.65 ± 1.06
node dist/index.js jpg 103.284 ± 0.1143.0803.4461.05 ± 0.05
node dist/index.js jpg 125.795 ± 3.2743.11212.0381.85 ± 1.05
node dist/index.js jpg 143.195 ± 0.1083.0163.3901.02 ± 0.04
node dist/index.js jpg 164.425 ± 2.6593.0719.8911.41 ± 0.85
Benchmark for JPG files.

And these for PNG files:

CommandMean [s]Min [s]Max [s]Relative
node dist/index.js png 13.155 ± 0.0823.0043.2681.00
node dist/index.js png 26.356 ± 5.9793.07018.0982.01 ± 1.90
node dist/index.js png 44.360 ± 2.4743.45211.3961.38 ± 0.78
node dist/index.js png 86.090 ± 4.8803.04516.6101.93 ± 1.55
node dist/index.js png 103.219 ± 0.1423.0623.5351.02 ± 0.05
node dist/index.js png 124.990 ± 3.8653.06712.5551.58 ± 1.23
node dist/index.js png 147.335 ± 9.0913.03931.9302.32 ± 2.88
node dist/index.js png 167.195 ± 5.2173.07817.0752.28 ± 1.65
Benchmark for PNG files.

From these stats I cannot draw any conclusions. Hyperfine tells me we have some statistical outliers, but my gut says that Cloudflare is blocking my access. To be continued...

expand_less