Compare Files with PowerShell: a faster way

Black thumb drive or "USB stick" as we call them here in The Netherlands. Black thumb drive or "USB stick" as we call them here in The Netherlands.

Sometimes you need to test if two files are the same. As files are getting larger, your scripts will take longer, so we need to look into performance. In this article, I'll show how to compare two files using a buffered approach in PowerShell.

When talking about performance it is better to measure multiple times on multiple systems. In this blog I only measured once on a single system, because I'm only interested in relative changes.

Strategies

When you want to compare files, you have some strategies:

  • Generate a checksum of both files and compare them. Generating a checksum means you have to parse both files from beginning to end. It would be great if we could stop parsing if one bit is different. So checksums are slow as an equality check.
  • Do a byte-by-byte comparison of both files. You can stop the comparison when a byte differs, so you don't have to read the entire file. Reading many bytes at once is usually faster.
  • Do a buffer comparison, reading several bytes at once and comparing that array.
  • Start with a compare of the file size -- this one makes perfect sense as files with a different size cannot be the same by definition.

Extra performance: use LINQ

PowerShell is a scripting language that has native support for comparison. But it is way faster to use LINQ to do the buffer comparison in native PowerShell! We' use the SequenceEqual, like this: [System.Linq.Enumerable]::SequenceEqual($one, $two). This LINQ method was 500x faster on my machine.

FilesAreEqual function

Let's create a function that takes two files, compares them and returns $true when both files are equal:

function FilesAreEqual {
    param(
        [System.IO.FileInfo] $first,
        [System.IO.FileInfo] $second, 
        [uint32] $bufferSize = 524288) 

    if ( $first.Length -ne $second.Length ) { return $false }
    if ( $bufferSize -eq 0 ) { $bufferSize = 524288 }

    $fs1 = $first.OpenRead()
    $fs2 = $second.OpenRead()

    $one = New-Object byte[] $bufferSize
    $two = New-Object byte[] $bufferSize
    $equal = $true

    do {
        $bytesRead = $fs1.Read($one, 0, $bufferSize)
        $fs2.Read($two, 0, $bufferSize) | out-null

        if ( -Not [System.Linq.Enumerable]::SequenceEqual($one, $two)) {
            $equal = $false
        }

    } while ($equal -and $bytesRead -eq $bufferSize)

    $fs1.Close()
    $fs2.Close()

    return $equal
}

Buffering works

So what happens when we plug different values into $bufferSize? This is what I got on my machine by comparing the Wars Of Liberty bin of 140MB with itself:

Buffer in bytesExecution time in secondsRatio
2.097.1522,48091,0210
1.048.5762,44001,0042
524.2882,42991,0000
262.1442,43571,0024
131.0722,49601,0272
65.5362,53021,0413
32.7682,64581,0888
16.3842,90351,1949
8.1924,32341,7793
4.0965,46702,2499
2.0487,68583,1630
1.02412,03294,9520

A buffer that is large enough will make your file compare way faster, that's why I settled on 524.288 as a number for this PowerShell function. When I compare the entire setup file of Wars of Liberty of 2.2GB it took me 42 seconds.

So what about a byte-by-byte comparison? That one took me 32 seconds on the 140MB file, so that's >13x slower than using a buffer of 524.288 bytes.

Just need a script?

If you just need a script, copy this text to compare.ps1 and run it on the command-line like .\compare.ps1 .\article.html .\article2.html. The result will be printed to the console.

<#
.SYNOPSIS
    Compares two files. Returns True if the files are equal.
.DESCRIPTION
    Compares two files. Returns True if the files are equal; otherwise False.
    Use the bufferSize to optimize for speed. Might depend on your system.
.PARAMETER file1
    The first file.
.PARAMETER file2
    The second file.
.PARAMETER bufferSize
    The size of the buffer will influence the speed of the script.
.OUTPUTS
    True when equal otherwise False.
.LINK
    More info: https://keestalkstech.com/2013/01/comparing-files-with-powershell/
#>
param(
    [Parameter(Mandatory = $true)]
    [string]
    $file1,
    [Parameter(Mandatory = $true)]
    [string]
    $file2, 
    [uint32]
    $bufferSize = 524288)

$ErrorActionPreference = "Stop"
$PSDefaultParameterValues['*:ErrorAction']='Stop'

$first = Get-Item $file1
$second = Get-Item $file2

if ( $first.Length -ne $second.Length ) { return $false }
if ( $bufferSize -eq 0 ) { $bufferSize = 524288 }

$fs1 = $first.OpenRead()
$fs2 = $second.OpenRead()

$one = New-Object byte[] $bufferSize
$two = New-Object byte[] $bufferSize
$equal = $true

do {
    $bytesRead = $fs1.Read($one, 0, $bufferSize)
    $fs2.Read($two, 0, $bufferSize) | out-null

    if ( -Not [System.Linq.Enumerable]::SequenceEqual($one, $two)) {
        $equal = $false
    }

} while ($equal -and $bytesRead -eq $bufferSize)

$fs1.Close()
$fs2.Close()

return $equal

Further reading

While working on the subject I found some interesting reads:

Improvements

2020-10-10: added the Just need a script? section.
2020-10-10: on some systems the first 2 if statements did not work, according to caspertone2003; fixed it with his code.
2020-06-08: the original article used a byte-by-byte comparison, slowing things down on larger files. After writing the impact of the buffering on file streams, I rewrote this article to include buffer and .NET LINQ to improve the performance.

  1. Kees C. Bakker says:

    Great feeback. Happy to help!

  2. Patrick Näf says:

    The problem is that the file compare reads only 8 bytes per iteration.
    Just set $BYTES_TO_READ = 32768 and all the slowness goes away :-) @Kees, thanks for sharing your code!

    1. Kees C. Bakker says:

      Changed the code. Thanks!

      1. Patrick Näf says:

        Sorry to bother you again. I realized that changing $BYTES_TO_READ is not enough, because inside the loop the BitConverter calls only compare the first 8 Bytes (= one Int64) of the buffer. After some deliberation I settled for a second, inner loop that iterates over the byte arrays and individually compares every byte. This is reasonably fast, and it’s especially much faster than the ultra-slow compare-object cmdlet.

        $byteArrayLength = $one.Length
        for ($j = 0; $j -lt $byteArrayLength; $j = $j + 1)
        {
        if ($one[$j] -ne $two[$j])
        {
        $fs1.Close();
        $fs2.Close();

        return $false;
        }
        }

  3. Mattia Lancieri says:

    I have 1 sqlite database of 2.5GB
    I made a copy of it and changed the value of a field in a table (added a character, the size of the DB is not increased).
    The function tells me that the 2 files are the same … something is not working.

    I entered the for loop modified by Patrick

    Edit:
    Using your original function the comparison works, but it is very slow.

    Do you know how to correctly integrate the for by Patrick so that it stays fast?

    Edit2:
    Ok I have correctly integrated Patrick’s nested loop, but it is still too slow to compare such large files. Thanks anyway

expand_less