PowerShell is very similar to .NET, so it is no surprise that it is very popular with .NET developers. It is a language for writing scripts, so you might encounter some unexpected situations. I had this experience when I tried to parse some HTML with PowerShell: I could not get the replacement with regular expression groups to work! It turned out that my .NET knowledge was working against me...
TL;DR
When creating regular expression or replacement string, use single quoted strings and you'll avoid a world of pain! Also make sure you use the proper regular expression options.
Let's create some data
Anything is better with an example, so let's use PowerShell to download a blog and extract the article content using a regular expression. First, we'll download a blog into a string, like this:
$ErrorActionPreference = "Stop"
# download article
$url = "https://keestalkstech.com/2020/05/plotting-a-grid-of-pil-images-in-jupyter/"
$article = Invoke-WebRequest $url -UseBasicParsing
# simple convert to string :-)
$article = "$article"
Fail on the first try
My first attempt was the following code:
$article = $article -replace ".*<article.*?>\s*(.*)\s*<\/article>.*", "$1"
It compiles. It looks good to me as a .NET developer... but... it does not do anything! I end up with exactly the same string I had...
Regular expression options: (?s)
First, we need to understand the way regular expression matching works in PowerShell: the default mode is that .
will not match new lines. To change the behavior into single line mode, you can specify the (?s)
to your expression, like this:
$article = $article -replace "(?s).*<article.*?>\s*(.*)\s*<\/article>.*", "$1"
Again: it compiles. It looks good to me... but now I end up with an empty string! ?
Quotation matters!
The main problem has to do with quotation. To me as a .NET developer the double quote is a string ("hello"
) and the single quote a char ('c'
). But to a PowerShell developer, a double quote means a string that supports variable replacement: "hello $name"
. We do not have a variable named $1
, so that's why our article is replaced by an empty string.
The following is more PowerShell-esque and actually works:
$article = $article -replace '(?s).*<article.*?>\s*(.*)\s*<\/article>.*', '$1'
But I love my double quotes...
If you are adamant on using double quotes, you must escape your dollar signs with a `
:
$article = $article -replace "(?s).*<article.*?>\s*(.*)\s*<\/article>.*", "`$1"
What about named capture replacement?
Sometimes named group captures improve readability of your code. As they also the dollar-sign, they "suffer" from the same problem, so use single quotes:
$article = $article -replace '(?s).*<article.*?>\s*(?<content>.*)\s*<\/article>.*', '${content}'
Did you know that ${1}
also works?
Final thoughts
Don't assume PowerShell and .NET are the same! Scripting-needs differ from application-programming-needs. To be on the safe side: use single quotes and your regular expression groups will work fine in PowerShell.
Funnily enough it was not the first time I had problems with regular expressions that looked similar to .NET; read this article about regular expression groups in JavaScript.
Improvements
2020-06-06: rewrote the article to reflect the problems with quotation and new-line matching.