Jupyter Notebooks + VSCode Dev Container with Puppeteer support

I have a love and hate relationship with Python; it is super easy to develop in Python (abundance of examples and packages, cross platform), but the setup of your development environment (Python versions, package versions) is too cumbersome. This increases when you need to share that setup with another developer on another platform. I work on Windows, my colleagues on Mac and Linux.

That's why I now use Visual Studio Code and "dev containers":

The Visual Studio Code Remote - Containers extension lets you use a Docker container as a full-featured development environment. It allows you to open any folder inside (or mounted into) a container and take advantage of Visual Studio Code's full feature set.

Source: Developing inside a Container

I have some reasons for switching to this setup:

  • Visual Studio Code is already my go tool for many languages (Node.js, bash, Terraform). Many of my colleagues also use it for various tasks.
  • Running the whole dev setup in Docker helps keeping the host system clean. It would not be the first time that someone "bricked" their Python setup due to a new version requirement.
  • Docker gives you a nice, replicable, cross platform setup: on each dev machine it should work the same.

Now, the main question is: will this work for Jupyter Notebooks? Visual Studio Code already provides an excellent UI, but will it run Jupyter in a development container? Let's find out! My use case is a scraper, so I need support for Puppeteer as well.

  1. Intro
  2. Default Python Dev Container Setup
  3. Install IPython & Pandas in the Dev Container
    1. Do I still need to install packages in my notebook?
  4. How about Puppeteer?
  5. Don't commit Jupyter Notebook output
    1. Auto save?
  6. Improving performance
  7. VSCode extensions
  8. Final thoughts
  9. Changelog
  10. Comments

Default Python Dev Container Setup

Let's start with a Python 3 development container:

  1. Open the command palette (F1 on Windows)
  2. Search for: Remote-Containers: Add Development Container Configuration files... This will open up a small wizard.
  3. Select Python 3 as the language.
  4. Select 3 as the version (it will add the latest anyway).
  5. We don't need Node.js, so let's select None (can be enabled later, but this option will make your container build faster).

The setup generates the configuration files in the .devcontainer folder of your project.

Install IPython & Pandas in the Dev Container

Now, open the Dockerfile from the .devcontainer folder and uncomment the following lines:

# [Optional] If your pip requirements rarely change, uncomment this section to add them to the image.
COPY requirements.txt /tmp/pip-tmp/
RUN pip3 --disable-pip-version-check --no-cache-dir install -r /tmp/pip-tmp/requirements.txt \
   && rm -rf /tmp/pip-tmp

This makes it possible to start using a requirements.txt file, to install packages with PIP on a container level.

Next, add the requirements.txt to the root of your project and enter the following lines:

ipython
ipykernel
pandas

This makes sure the container installs the packages you need for your notebook.

Do I still need to install packages in my notebook?

Yo don't have too, but it might make things easier if you do and it caters to people that don't use your setup. With this simple line, you will install all the packages specified in your requirements.txt file:

%pip install --quiet --exists-action i --disable-pip-version-check -r ../requirements.txt --user

It will complete fast, as all things are already installed in your dev container. This also makes adding new packages easier, as you don't have to restart your dev container.

How about Puppeteer?

We're going to use the Python version of Puppeteer called Pyppeteer. Running Puppeteer from a container is not straightforward. Let's airlift the code from this article into our setup.

Add the following lines to the Dockerfile:

# Install Google Chrome Stable and fonts
# Note: this installs the necessary libs to make the browser work with Puppeteer.
ENV PUPPETEER_SKIP_CHROMIUM_DOWNLOAD true
RUN apt-get update && export DEBIAN_FRONTEND=noninteractive && apt-get install gnupg wget -y && \
  wget --quiet --output-document=- https://dl-ssl.google.com/linux/linux_signing_key.pub | gpg --dearmor > /etc/apt/trusted.gpg.d/google-archive.gpg && \
  sh -c 'echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google.list' && \
  apt-get update && \
  apt-get install google-chrome-stable -y --no-install-recommends && \
  rm -rf /var/lib/apt/lists/*

This will download and install Chrome to the container. Now we just need to add the Puppeteer packages to our requirements.txt file:

ipython
ipykernel
pandas
pyppeteer

Add the following line to your notebook:

%pip install --quiet --exists-action i --disable-pip-version-check pyppeteer

Now, when we launch the browser in the notebook, we only have to test if we're in our development container:

browser_options = {
    'headless': True,
    'args': ["--no-sandbox"]
}

if os.getenv('PUPPETEER_SKIP_CHROMIUM_DOWNLOAD', '') != '':
    browser_options["executablePath"] = '/usr/bin/google-chrome'

browser = await launch(browser_options)

This makes sure that the pre-installed version of Chrome is used in your container.

Don't commit Jupyter Notebook output

We want to improve what we commit to Git: let's not commit the output of the notebook, by implementing the code of this article. First, let's add the nbconvert package to your requirements.txt:

ipython
ipykernel
nbconvert
pandas
pyppeteer
pyppeteer_stealth

Let's configure Git so it can use it; add a .gitconfig file to the root of your project:

[filter "strip-notebook-output"]
    clean = "jupyter nbconvert --ClearOutputPreprocessor.enabled=True --to=notebook --stdin --stdout --log-level=ERROR"

Now add a file called .gitattributes:

*.ipynb filter=strip-notebook-output

The last file you'll need to add is repo_init.sh:

#!/usr/bin/env bash

git config --local include.path ../.gitconfig
git add --renormalize .

Now, start your project in the dev container and hook things up with bash repo_init.sh. Every time a user clones your repository for the first time, this script needs to be executed.

Now Git is able to understand what has changed, and will no longer show a change when you just ran the script. Note: VSCode will still think the notebook has changed, but Git itself will not commit the change.

Auto save?

When you run the notebook, the file is changed. Notebooks store both the code and the output. I find it very annoying that VSCode shows an unsaved file in my IDE (and tries to restore it if I close the editor). To mitigate this, you can enable auto save to your dev container settings:

  1. Open up .devcontainer/devcontainer.json
  2. Navigate to customizations > vscode > settings
  3. Add: "files.autoSave": "afterDelay",
  4. Add "files.autoSaveDelay": 1000

Pet peeve fixed.

Improving performance

Our setup currently is naïve, as it does not fully leverage Docker caching. Let's look at our final Dockerfile:

# See here for image contents: https://github.com/microsoft/vscode-dev-containers/tree/v0.245.0/containers/python-3/.devcontainer/base.Dockerfile

# [Choice] Python version (use -bullseye variants on local arm64/Apple Silicon): 3, 3.10, 3.9, 3.8, 3.7, 3.6, 3-bullseye, 3.10-bullseye, 3.9-bullseye, 3.8-bullseye, 3.7-bullseye, 3.6-bullseye, 3-buster, 3.10-buster, 3.9-buster, 3.8-buster, 3.7-buster, 3.6-buster
ARG VARIANT="3.10-bullseye"
FROM mcr.microsoft.com/vscode/devcontainers/python:0-${VARIANT}

# [Choice] Node.js version: none, lts/*, 16, 14, 12, 10
ARG NODE_VERSION="none"
RUN if [ "${NODE_VERSION}" != "none" ]; then su vscode -c "umask 0002 && . /usr/local/share/nvm/nvm.sh && nvm install ${NODE_VERSION} 2>&1"; fi

# Install Google Chrome Stable and fonts
# Note: this installs the necessary libs to make the browser work with Puppeteer.
ENV PUPPETEER_SKIP_CHROMIUM_DOWNLOAD true
RUN apt-get update && apt-get install gnupg wget -y && \
  wget --quiet --output-document=- https://dl-ssl.google.com/linux/linux_signing_key.pub | gpg --dearmor > /etc/apt/trusted.gpg.d/google-archive.gpg && \
  sh -c 'echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google.list' && \
  apt-get update && \
  apt-get install google-chrome-stable -y --no-install-recommends && \
  rm -rf /var/lib/apt/lists/*

# [Optional] If your pip requirements rarely change, uncomment this section to add them to the image.
COPY requirements.txt /tmp/pip-tmp/
RUN pip3 --disable-pip-version-check --no-cache-dir install -r /tmp/pip-tmp/requirements.txt \
   && rm -rf /tmp/pip-tmp

By swapping the installation of the packages, you don't need to reinstall Chrome every time your Python packages changes.

VSCode extensions

The nice thing about this setup is the ability to share your Visual Studio Code Extensions. They are stored in the devcontainer.json, just like the settings. I'm using these extensions:

I replaced the customizations > vscode > extensions node with this:

      "extensions": [
        "ms-python.python",
        "ms-python.vscode-pylance",
        "ms-toolsai.jupyter",
        "ms-toolsai.jupyter-keymap",
        "ms-toolsai.jupyter-renderers",
        "vscode-icons-team.vscode-icons",
        "wayou.vscode-todo-highlight",
        "timonwong.shellcheck"
      ]

When somebody opens the projects it is notified about the extensions.

Final thoughts

I like the fact that this setup works on any machine. Consider locking your versions in, as both Chrome and your packages are not versioned (so you might download breaking changes).

I don't like the fact that Visual Studio Code thinks a notebook has changed, while Git knows it isn't. According to issue #9514, this is something that should be fixed in the core of Visual Studio Code. So, I'm not really sure why issue #83232 (or #24883) is closed.

Changelog

  • 2022-08-28 Initial article.
  • 2022-08-29 Changed the %pip install line to use the requirements.txt file.
expand_less