Sometimes you need to test if two files are the same. As files are getting larger, your scripts will take longer, so we need to look into performance. In this article, I’ll show how to compare two files using a buffered approach in PowerShell.

When talking about performance it is better to measure multiple times on multiple systems. In this blog I only measured once on a single system, because I’m only interested in relative changes.

Strategies

When you want to compare files, you have some strategies:

  • Generate a checksum of both files and compare them. Generating a checksum means you have to parse both files from beginning to end. It would be great if we could stop parsing if one bit is different. So checksums are slow as an equality check.
  • Do a byte-by-byte comparison of both files. You can stop the comparison when a byte differs, so you don’t have to read the entire file. Reading many bytes at once is usually faster.
  • Do a buffer comparison, reading several bytes at once and comparing that array.
  • Start with a compare of the file size — this one makes perfect sense as files with a different size cannot be the same by definition.

Extra performance: use LINQ

PowerShell is a scripting language that has native support for comparison. But it is way faster to use LINQ to do the buffer comparison in native PowerShell! We’ use the SequenceEqual, like this: [System.Linq.Enumerable]::SequenceEqual($one, $two). This LINQ method was 500x faster on my machine.

FilesAreEqual function

Let’s create a function that takes two files, compares them and returns $true when both files are equal:

function FilesAreEqual {
    param(
        [System.IO.FileInfo] $first,
        [System.IO.FileInfo] $second, 
        [uint32] $bufferSize = 524288) 

    if ($first.Length -ne $second.Length) return $false

    if ( $bufferSize -eq 0 ) $bufferSize = 524288

    $fs1 = $first.OpenRead()
    $fs2 = $second.OpenRead()

    $one = New-Object byte[] $bufferSize
    $two = New-Object byte[] $bufferSize
    $equal = $true

    do {
        $bytesRead = $fs1.Read($one, 0, $bufferSize)
        $fs2.Read($two, 0, $bufferSize) | out-null

        if ( -Not [System.Linq.Enumerable]::SequenceEqual($one, $two)) {
            $equal = $false
        }

    } while ($equal -and $bytesRead -eq $bufferSize)

    $fs1.Close()
    $fs2.Close()

    return $equal
}

Buffering works

So what happens when we plug different values into $bufferSize? This is what I got on my machine by comparing the Wars Of Liberty bin of 140MB with itself:

Buffer in bytesExecution time in secondsRatio
2.097.1522,48091,0210
1.048.5762,44001,0042
524.2882,42991,0000
262.1442,43571,0024
131.0722,49601,0272
65.5362,53021,0413
32.7682,64581,0888
16.3842,90351,1949
8.1924,32341,7793
4.0965,46702,2499
2.0487,68583,1630
1.02412,03294,9520

A buffer that is large enough will make your file compare way faster, that’s why I settled on 524.288 as a number for this PowerShell function. When I compare the entire setup file of Wars of Liberty of 2.2GB it took me 42 seconds.

So what about a byte-by-byte comparison? That one took me 32 seconds on the 140MB file, so that’s >13x slower than using a buffer of 524.288 bytes.

Further reading

While working on the subject I found some interesting reads:

Improvements

2020-06-08: the original article used a byte-by-byte comparison, slowing things down on larger files. After writing the impact of the buffering on file streams, I rewrote this article to include buffer and .NET LINQ to improve the performance.