You have a bunch of online services that let you take screenshots of a site and save them in a folder. While it can be very useful to pay for such a system, it is not so hard to create it. Let's use Chrome / Chromium with Puppeteer and Node.js (cluster) to take some snapshots in no-time. We'll use the Puppeteer Cluster package to run multiple threads / workers to grab those screens in parallel. We'll be using TypeScript.
Packages
Let's install the following packages:
npm install puppeteer
npm install puppeteer-cluster
npm install sanitize-filename
npm install tmp
npm install -D @types/tmp
Define the inputs
The Puppeteer Cluster API has many options and building on it is super easy. But I want something that is even more simple and reusable: let's take a bunch of URLs and just save them in a directory as fast as possible.
Let's define a list of URLs we would like to screenshot:
const URLS: Record<string, string> = {
Homepage: "https://www.wehkamp.nl/",
Dames: "https://www.wehkamp.nl/damesmode/C21/",
Heren: "https://www.wehkamp.nl/herenmode/C22/",
Kinderen: "https://www.wehkamp.nl/kinderen/C23/",
Baby: "https://www.wehkamp.nl/baby/C50/",
Beauty: "https://www.wehkamp.nl/mooi-gezond/C29/",
"Wonen & slapen": "https://www.wehkamp.nl/wonen-slapen/C28/",
"Koken & tafelen": "https://www.wehkamp.nl/koken-tafelen/C51/",
Tuin: "https://www.wehkamp.nl/tuin-klussen/C30/",
Huishouden: "https://www.wehkamp.nl/huishouden/C27/",
Elektronica: "https://www.wehkamp.nl/elektronica/C26/",
Speelgoed: "https://www.wehkamp.nl/speelgoed-games/C25/",
"Sport & vrije tijd": "https://www.wehkamp.nl/sport-vrije-tijd/C24/",
"Boeken, films & muziek": "https://www.wehkamp.nl/boeken-films-muziek/C31/",
Cadeaus: "https://www.wehkamp.nl/cadeaushop/C52/",
Sale: "https://www.wehkamp.nl/sale/OPR/",
Actie: "https://www.wehkamp.nl/actie/C92/",
Kleding: "https://www.wehkamp.nl/kleding/C80/",
}
Next, let's define the type we need if we want to process a single URL and call that type our ScreenshotTask
:
import { ScreenshotOptions } from "puppeteer"
export type ScreenshotTask = {
name?: string
url: string
width?: number
height?: number
screenshotOptions: ScreenshotOptions
onProcessed?: (settings: ScreenshotTask) => Promise<void>
onBeforeScreenshot?: (settings: ScreenshotTask, page: Page) => Promise<void>
}
We've borrowed the ScreenshotOptions
from the Puppeteer project. We'll use the width
and height
to change the view-port (can be used for mobile screenshots). If you set the screenshotOptions.path
property, Puppeteer will write the screenshot to that path.
Now, let's imagine the same settings, but now for processing all the URLs:
export type ProcessorOptions = {
onProcessedItem?: (task: ProcessedScreenshot) => Promise<void>
onBeforeScreenshot?: (settings: ScreenshotTask, page: Page) => Promise<void>
screenshotOptions?: ScreenshotOptions
maxConcurrency?: number
width?: number
height?: number
folder?: string
}
Now that we have our inputs defined, we can create a class that uses the inputs to create a single screenshot.
Taking a single screenshot
We want to take multiple screenshots in parallel, so we should be able the queue a single screenshot. This class will control the cluster flow (starting, stopping and waiting for everything to finish). We're trying to create an interaction that is simple for the end user, so I would like to abstract things away:
import { Cluster } from "puppeteer-cluster"
export class ScreenshotProcessor {
private cluster: Cluster<ScreenshotTask, void>
constructor(private maxConcurrency: number) {}
public async start() {
if (!this.cluster) {
this.cluster = await Cluster.launch({
concurrency: Cluster.CONCURRENCY_BROWSER,
maxConcurrency: this.maxConcurrency,
})
}
}
public async stop() {
if (this.cluster) {
await this.cluster.idle()
await this.cluster.close()
this.cluster = null
}
}
public async queue(task: ScreenshotTask) {
await this.start()
await this.cluster.queue(task, async task => {
if (task.data.height || task.data.width) {
await task.page.setViewport({
width: task.data.width || 800,
height: task.data.height || 600,
})
}
await task.page.goto(task.data.url)
if (task.data.onBeforeScreenshot) {
await task.data.onBeforeScreenshot(task.data, task.page);
}
try {
await task.page.screenshot(task.data.screenshotOptions)
} catch (ex) {
console.error(
`Error while taking screenshot\nData: ${JSON.stringify(
task.data
)}\n${ex}`
)
throw ex
}
if (task.data.onProcessed) {
await task.data.onProcessed(task.data)
}
})
}
}
The steps to take the screenshot are:
- Make sure the view-port is set.
- Visit the URL.
- Take a screenshot.
- Execute callback if everything went OK.
We could now take multiple screenshots by executing:
;(async () => {
let processor = new ScreenshotProcessor(8)
await processor.start()
Object.entries(URLS).forEach(async ([name, url]) => {
processor.queue({
name,
url,
screenshotOptions: {
fullPage: true,
type: "png",
path: `c:\\temp\\test\\${name}.png`,
},
onProcessed: async task =>
console.log(
`${task.name} is now saved at: ${task.screenshotOptions.path}`
),
})
})
await processor.stop()
console.log("Finished")
})()
Provided the folder of each path
is present, this will work fine.
Let's make it even easier
The screenshot settings will not change that often, so let's export them as constants:
export const FullPagePngScreenshots = Object.freeze(<ScreenshotOptions>{
fullPage: true,
type: "png",
})
export const FullPageJpgScreenshots = Object.freeze(<ScreenshotOptions>{
fullPage: true,
type: "jpeg",
quality: 80,
})
This way, new people that interact with our code don't have to understand the Puppeteer docs before they get started.
Making sure a destination folder is actually present, is something we could automate! Let's add the following static
method to our ScreenshotProcessor
:
import sanitize from "sanitize-filename"
import tmp from "tmp"
import fs from "node:fs/promises"
import path from "node:path"
import { ScreenshotOptions } from "puppeteer"
public static async run(
urls: Record<string, string>,
options: ProcessorOptions
) {
// start the processor
let processor = new ScreenshotProcessor(options.maxConcurrency || 8)
await processor.start()
// ensure folder path
let tmpFolder =
options.folder || tmp.dirSync({ postfix: "screenshots" }).name
tmpFolder = path.resolve(tmpFolder)
await fs.mkdir(tmpFolder, { recursive: true })
Object.entries(urls).forEach(([name, url]) => {
let screenshotOptions = Object.assign(
{},
options.screenshotOptions || FullPageJpgScreenshots
) as ScreenshotOptions
let tempFilePath = path.join(
tmpFolder,
`${sanitize(name)}.${screenshotOptions.type}`
)
screenshotOptions.path = tempFilePath
let task: ScreenshotTask = {
name,
url: url,
onBeforeScreenshot: options.onBeforeScreenshot,
width: options.width,
height: options.height,
screenshotOptions: screenshotOptions,
onProcessed: async () => {
if (options.onProcessedItem) {
await options.onProcessedItem({
name: name,
url: url,
path: tempFilePath,
})
}
},
}
processor.queue(task)
})
await processor.stop()
}
We take the burden of starting and stopping our processor and the burden of creating a temporary folder into this method. Our example code does the same, but is now way smaller:
;(async () => {
await ScreenshotProcessor.run(URLS, {
onProcessedItem: async task => {
console.log(`${task.name} is now saved at: ${task.path}`)
},
})
console.log("Finished")
})()
So what about concurrency?
When I used hyperfine to benchmark the screenshotting of 18 files, and I got these results for JPG files:
Command | Mean [s] | Min [s] | Max [s] | Relative |
---|---|---|---|---|
node dist/index.js jpg 1 | 5.461 ± 3.985 | 2.957 | 12.848 | 1.75 ± 1.27 |
node dist/index.js jpg 2 | 3.129 ± 0.084 | 3.016 | 3.297 | 1.00 |
node dist/index.js jpg 4 | 3.924 ± 2.549 | 2.990 | 11.175 | 1.25 ± 0.82 |
node dist/index.js jpg 8 | 5.149 ± 3.310 | 3.041 | 11.043 | 1.65 ± 1.06 |
node dist/index.js jpg 10 | 3.284 ± 0.114 | 3.080 | 3.446 | 1.05 ± 0.05 |
node dist/index.js jpg 12 | 5.795 ± 3.274 | 3.112 | 12.038 | 1.85 ± 1.05 |
node dist/index.js jpg 14 | 3.195 ± 0.108 | 3.016 | 3.390 | 1.02 ± 0.04 |
node dist/index.js jpg 16 | 4.425 ± 2.659 | 3.071 | 9.891 | 1.41 ± 0.85 |
And these for PNG files:
Command | Mean [s] | Min [s] | Max [s] | Relative |
---|---|---|---|---|
node dist/index.js png 1 | 3.155 ± 0.082 | 3.004 | 3.268 | 1.00 |
node dist/index.js png 2 | 6.356 ± 5.979 | 3.070 | 18.098 | 2.01 ± 1.90 |
node dist/index.js png 4 | 4.360 ± 2.474 | 3.452 | 11.396 | 1.38 ± 0.78 |
node dist/index.js png 8 | 6.090 ± 4.880 | 3.045 | 16.610 | 1.93 ± 1.55 |
node dist/index.js png 10 | 3.219 ± 0.142 | 3.062 | 3.535 | 1.02 ± 0.05 |
node dist/index.js png 12 | 4.990 ± 3.865 | 3.067 | 12.555 | 1.58 ± 1.23 |
node dist/index.js png 14 | 7.335 ± 9.091 | 3.039 | 31.930 | 2.32 ± 2.88 |
node dist/index.js png 16 | 7.195 ± 5.217 | 3.078 | 17.075 | 2.28 ± 1.65 |
From these stats I cannot draw any conclusions. Hyperfine tells me we have some statistical outliers, but my gut says that Cloudflare is blocking my access. To be continued...