Regular Expression Groups in PowerShell (for .NET people)

Aerial photo of a dam. Aerial photo of a dam.

PowerShell is very similar to .NET, so it is no surprise that it is very popular with .NET developers. It is a language for writing scripts, so you might encounter some unexpected situations. I had this experience when I tried to parse some HTML with PowerShell: I could not get the replacement with regular expression groups to work! It turned out that my .NET knowledge was working against me...

TL;DR

When creating regular expression or replacement string, use single quoted strings and you'll avoid a world of pain! Also make sure you use the proper regular expression options.

Let's create some data

Anything is better with an example, so let's use PowerShell to download a blog and extract the article content using a regular expression. First, we'll download a blog into a string, like this:

$ErrorActionPreference = "Stop"

# download article
$url = "https://keestalkstech.com/2020/05/plotting-a-grid-of-pil-images-in-jupyter/"
$article = Invoke-WebRequest $url -UseBasicParsing

# simple convert to string :-)
$article = "$article"

Fail on the first try

My first attempt was the following code:

$article = $article -replace ".*<article.*?>\s*(.*)\s*<\/article>.*", "$1"

It compiles. It looks good to me as a .NET developer... but... it does not do anything! I end up with exactly the same string I had...

Regular expression options: (?s)

First, we need to understand the way regular expression matching works in PowerShell: the default mode is that . will not match new lines. To change the behavior into single line mode, you can specify the (?s) to your expression, like this:

$article = $article -replace "(?s).*<article.*?>\s*(.*)\s*<\/article>.*", "$1"

Again: it compiles. It looks good to me... but now I end up with an empty string! ?

Quotation matters!

The main problem has to do with quotation. To me as a .NET developer the double quote is a string ("hello") and the single quote a char ('c'). But to a PowerShell developer, a double quote means a string that supports variable replacement: "hello $name". We do not have a variable named $1, so that's why our article is replaced by an empty string.

The following is more PowerShell-esque and actually works:

$article = $article -replace '(?s).*<article.*?>\s*(.*)\s*<\/article>.*', '$1'

But I love my double quotes...

If you are adamant on using double quotes, you must escape your dollar signs with a `:

$article = $article -replace "(?s).*<article.*?>\s*(.*)\s*<\/article>.*", "`$1"

What about named capture replacement?

Sometimes named group captures improve readability of your code. As they also the dollar-sign, they "suffer" from the same problem, so use single quotes:

$article = $article -replace '(?s).*<article.*?>\s*(?<content>.*)\s*<\/article>.*', '${content}'

Did you know that ${1} also works?

Final thoughts

Don't assume PowerShell and .NET are the same! Scripting-needs differ from application-programming-needs. To be on the safe side: use single quotes and your regular expression groups will work fine in PowerShell.

Funnily enough it was not the first time I had problems with regular expressions that looked similar to .NET; read this article about regular expression groups in JavaScript.

Improvements

2020-06-06: rewrote the article to reflect the problems with quotation and new-line matching.

expand_less