Regular Expression Groups in PowerShell (for .NET people)

Aerial photo of a dam. Aerial photo of a dam.

PowerShell is very similar to .NET, so it is no surprise that it is very popular with .NET developers. It is a language for writing scripts, so you might encounter some unexpected situations. I had this experience when I tried to parse some HTML with PowerShell: I could not get the replacement with regular expression groups to work! It turned out that my .NET knowledge was working against me...


When creating regular expression or replacement string, use single quoted strings and you'll avoid a world of pain! Also make sure you use the proper regular expression options.

Let's create some data

Anything is better with an example, so let's use PowerShell to download a blog and extract the article content using a regular expression. First, we'll download a blog into a string, like this:

$ErrorActionPreference = "Stop"

# download article
$url = ""
$article = Invoke-WebRequest $url -UseBasicParsing

# simple convert to string :-)
$article = "$article"

Fail on the first try

My first attempt was the following code:

$article = $article -replace ".*<article.*?>\s*(.*)\s*<\/article>.*", "$1"

It compiles. It looks good to me as a .NET developer... but... it does not do anything! I end up with exactly the same string I had...

Regular expression options: (?s)

First, we need to understand the way regular expression matching works in PowerShell: the default mode is that . will not match new lines. To change the behavior into single line mode, you can specify the (?s) to your expression, like this:

$article = $article -replace "(?s).*<article.*?>\s*(.*)\s*<\/article>.*", "$1"

Again: it compiles. It looks good to me... but now I end up with an empty string! ?

Quotation matters!

The main problem has to do with quotation. To me as a .NET developer the double quote is a string ("hello") and the single quote a char ('c'). But to a PowerShell developer, a double quote means a string that supports variable replacement: "hello $name". We do not have a variable named $1, so that's why our article is replaced by an empty string.

The following is more PowerShell-esque and actually works:

$article = $article -replace '(?s).*<article.*?>\s*(.*)\s*<\/article>.*', '$1'

But I love my double quotes...

If you are adamant on using double quotes, you must escape your dollar signs with a `:

$article = $article -replace "(?s).*<article.*?>\s*(.*)\s*<\/article>.*", "`$1"

What about named capture replacement?

Sometimes named group captures improve readability of your code. As they also the dollar-sign, they "suffer" from the same problem, so use single quotes:

$article = $article -replace '(?s).*<article.*?>\s*(?<content>.*)\s*<\/article>.*', '${content}'

Did you know that ${1} also works?

Final thoughts

Don't assume PowerShell and .NET are the same! Scripting-needs differ from application-programming-needs. To be on the safe side: use single quotes and your regular expression groups will work fine in PowerShell.

Funnily enough it was not the first time I had problems with regular expressions that looked similar to .NET; read this article about regular expression groups in JavaScript.


2020-06-06: rewrote the article to reflect the problems with quotation and new-line matching.