# Task: take some website screenshots and put them in a folder (in parallel)

**Date:** 2022-07-30  
**Author:** Kees C. Bakker  
**Categories:** Node.js, TypeScript  
**Original:** https://keestalkstech.com/task-take-some-website-screenshots-and-put-them-in-a-folder-in-parallel/

![Task: take some website screenshots and put them in a folder (in parallel)](https://keestalkstech.com/wp-content/uploads/2022/07/patrick-dozkVhDyvhQ-unsplash.jpg)

---

You have a bunch of online services that let you take screenshots of a site and save them in a folder. While it can be very useful to pay for such a system, it is not so hard to create it. Let's use Chrome / Chromium with [Puppeteer](https://pptr.dev/) and Node.js (cluster) to take some snapshots in no-time. We'll use the [Puppeteer Cluster](https://www.npmjs.com/package/puppeteer-cluster) package to run multiple threads / workers to grab those screens in *parallel.* We'll be using TypeScript.

## Packages

Let's install the following packages:

```sh
npm install puppeteer
npm install puppeteer-cluster
npm install sanitize-filename
npm install tmp
npm install -D @types/tmp
```

## Define the inputs

The Puppeteer Cluster API has many options and building on it is super easy. But I want something that is even more simple and reusable: *let's take a bunch of URLs and just save them in a directory as fast as possible*.

Let's define a list of URLs we would like to screenshot:

```ts
const URLS: Record<string, string> = {
  Homepage: "https://www.wehkamp.nl/",
  Dames: "https://www.wehkamp.nl/damesmode/C21/",
  Heren: "https://www.wehkamp.nl/herenmode/C22/",
  Kinderen: "https://www.wehkamp.nl/kinderen/C23/",
  Baby: "https://www.wehkamp.nl/baby/C50/",
  Beauty: "https://www.wehkamp.nl/mooi-gezond/C29/",
  "Wonen & slapen": "https://www.wehkamp.nl/wonen-slapen/C28/",
  "Koken & tafelen": "https://www.wehkamp.nl/koken-tafelen/C51/",
  Tuin: "https://www.wehkamp.nl/tuin-klussen/C30/",
  Huishouden: "https://www.wehkamp.nl/huishouden/C27/",
  Elektronica: "https://www.wehkamp.nl/elektronica/C26/",
  Speelgoed: "https://www.wehkamp.nl/speelgoed-games/C25/",
  "Sport & vrije tijd": "https://www.wehkamp.nl/sport-vrije-tijd/C24/",
  "Boeken, films & muziek": "https://www.wehkamp.nl/boeken-films-muziek/C31/",
  Cadeaus: "https://www.wehkamp.nl/cadeaushop/C52/",
  Sale: "https://www.wehkamp.nl/sale/OPR/",
  Actie: "https://www.wehkamp.nl/actie/C92/",
  Kleding: "https://www.wehkamp.nl/kleding/C80/",
}
```

Next, let's define the type we need if we want to process a *single* URL and call that type our `ScreenshotTask`:

```ts
import { ScreenshotOptions } from "puppeteer"

export type ScreenshotTask = {
  name?: string
  url: string
  width?: number
  height?: number
  screenshotOptions: ScreenshotOptions
  onProcessed?: (settings: ScreenshotTask) => Promise<void>
  onBeforeScreenshot?: (settings: ScreenshotTask, page: Page) => Promise<void>
}
```

We've borrowed the `ScreenshotOptions` from the [Puppeteer project](https://github.com/puppeteer/puppeteer/blob/main/website/versioned_docs/version-15.5.0/api/puppeteer.screenshotoptions.md). We'll use the `width` and `height` to change the view-port (can be used for mobile screenshots). If you set the `screenshotOptions.path` property, Puppeteer will write the screenshot to that path.

Now, let's imagine the same settings, but now for processing *all* the URLs:

```ts
export type ProcessorOptions = {
  onProcessedItem?: (task: ProcessedScreenshot) => Promise<void>
  onBeforeScreenshot?: (settings: ScreenshotTask, page: Page) => Promise<void>
  screenshotOptions?: ScreenshotOptions
  maxConcurrency?: number
  width?: number
  height?: number
  folder?: string
}
```

Now that we have our inputs defined, we can create a class that uses the inputs to create a single screenshot.

## Taking a single screenshot

We want to take *multiple* screenshots in parallel, so we should be able the queue a *single* screenshot. This class will control the cluster flow (starting, stopping and waiting for everything to finish). We're trying to create an interaction that is simple for the end user, so I would like to abstract things away:

```ts
import { Cluster } from "puppeteer-cluster"

export class ScreenshotProcessor {
  private cluster: Cluster<ScreenshotTask, void>

  constructor(private maxConcurrency: number) {}

  public async start() {
    if (!this.cluster) {
      this.cluster = await Cluster.launch({
        concurrency: Cluster.CONCURRENCY_BROWSER,
        maxConcurrency: this.maxConcurrency,
      })
    }
  }

  public async stop() {
    if (this.cluster) {
      await this.cluster.idle()
      await this.cluster.close()
      this.cluster = null
    }
  }

  public async queue(task: ScreenshotTask) {
    await this.start()
    await this.cluster.queue(task, async task => {
      if (task.data.height || task.data.width) {
        await task.page.setViewport({
          width: task.data.width || 800,
          height: task.data.height || 600,
        })
      }

      await task.page.goto(task.data.url)
      
      if (task.data.onBeforeScreenshot) {
        await task.data.onBeforeScreenshot(task.data, task.page);
      }

      try {
        await task.page.screenshot(task.data.screenshotOptions)
      } catch (ex) {
        console.error(
          `Error while taking screenshot\nData: ${JSON.stringify(
            task.data
          )}\n${ex}`
        )
        throw ex
      }

      if (task.data.onProcessed) {
        await task.data.onProcessed(task.data)
      }
    })
  }
}
```

The steps to take the screenshot are:

We could now take multiple screenshots by executing:

```ts
;(async () => {
  let processor = new ScreenshotProcessor(8)

  await processor.start()

  Object.entries(URLS).forEach(async ([name, url]) => {
    processor.queue({
      name,
      url,
      screenshotOptions: {
        fullPage: true,
        type: "png",
        path: `c:\\temp\\test\\${name}.png`,
      },
      onProcessed: async task =>
        console.log(
          `${task.name} is now saved at: ${task.screenshotOptions.path}`
        ),
    })
  })

  await processor.stop()
  console.log("Finished")
})()
```

Provided the folder of each `path` is present, this will work fine.

## Let's make it even easier

The screenshot settings will not change that often, so let's export them as constants:

```ts
export const FullPagePngScreenshots = Object.freeze(<ScreenshotOptions>{
  fullPage: true,
  type: "png",
})

export const FullPageJpgScreenshots = Object.freeze(<ScreenshotOptions>{
  fullPage: true,
  type: "jpeg",
  quality: 80,
})
```

This way, new people that interact with our code don't have to understand the Puppeteer docs before they get started.

Making sure a destination folder is actually present, is something we could automate! Let's add the following `static` method to our `ScreenshotProcessor`:

```ts
import sanitize from "sanitize-filename"
import tmp from "tmp"
import fs from "node:fs/promises"
import path from "node:path"
import { ScreenshotOptions } from "puppeteer"

public static async run(
  urls: Record<string, string>,
  options: ProcessorOptions
) {
  // start the processor
  let processor = new ScreenshotProcessor(options.maxConcurrency || 8)
  await processor.start()

  // ensure folder path
  let tmpFolder =
 options.folder || tmp.dirSync({ postfix: "screenshots" }).name
  tmpFolder = path.resolve(tmpFolder)
  await fs.mkdir(tmpFolder, { recursive: true })

  Object.entries(urls).forEach(([name, url]) => {
    let screenshotOptions = Object.assign(
      {},
      options.screenshotOptions || FullPageJpgScreenshots
    ) as ScreenshotOptions

    let tempFilePath = path.join(
      tmpFolder,
      `${sanitize(name)}.${screenshotOptions.type}`
    )
    screenshotOptions.path = tempFilePath

    let task: ScreenshotTask = {
      name,
      url: url,
      onBeforeScreenshot: options.onBeforeScreenshot,
      width: options.width,
      height: options.height,
      screenshotOptions: screenshotOptions,
      onProcessed: async () => {
        if (options.onProcessedItem) {
          await options.onProcessedItem({
            name: name,
            url: url,
            path: tempFilePath,
          })
        }
      },
    }
    processor.queue(task)
  })

  await processor.stop()
}
```

We take the burden of starting and stopping our processor and the burden of creating a temporary folder into this method. Our example code does the same, but is now way smaller:

```ts
;(async () => {
  await ScreenshotProcessor.run(URLS, {
    onProcessedItem: async task => {
      console.log(`${task.name} is now saved at: ${task.path}`)
    },
  })
  console.log("Finished")
})()
```

## So what about concurrency?

When I used [hyperfine](https://github.com/sharkdp/hyperfine) to benchmark the screenshotting of 18 files, and I got these results for JPG files:

And these for PNG files:

From these stats I cannot draw any conclusions. Hyperfine tells me we have some statistical outliers, but my gut says that Cloudflare is blocking my access. To be continued...
