Sometimes you need to test if two files are the same. As files are getting larger, your scripts will take longer, so we need to look into performance. In this article, I'll show how to compare two files using a buffered approach in PowerShell.
When talking about performance it is better to measure multiple times on multiple systems. In this blog I only measured once on a single system, because I'm only interested in relative changes.
Strategies
When you want to compare files, you have some strategies:
- Generate a checksum of both files and compare them. Generating a checksum means you have to parse both files from beginning to end. It would be great if we could stop parsing if one bit is different. So checksums are slow as an equality check.
- Do a byte-by-byte comparison of both files. You can stop the comparison when a byte differs, so you don't have to read the entire file. Reading many bytes at once is usually faster.
- Do a buffer comparison, reading several bytes at once and comparing that array.
- Start with a compare of the file size -- this one makes perfect sense as files with a different size cannot be the same by definition.
Extra performance: use LINQ
PowerShell is a scripting language that has native support for comparison. But it is way faster to use LINQ to do the buffer comparison in native PowerShell! We' use the SequenceEqual, like this: [System.Linq.Enumerable]::SequenceEqual($one, $two)
. This LINQ method was 500x faster on my machine.
FilesAreEqual function
Let's create a function that takes two files, compares them and returns $true
when both files are equal:
function FilesAreEqual {
param(
[System.IO.FileInfo] $first,
[System.IO.FileInfo] $second,
[uint32] $bufferSize = 524288)
if ( $first.Length -ne $second.Length ) { return $false }
if ( $bufferSize -eq 0 ) { $bufferSize = 524288 }
$fs1 = $first.OpenRead()
$fs2 = $second.OpenRead()
$one = New-Object byte[] $bufferSize
$two = New-Object byte[] $bufferSize
$equal = $true
do {
$bytesRead = $fs1.Read($one, 0, $bufferSize)
$fs2.Read($two, 0, $bufferSize) | out-null
if ( -Not [System.Linq.Enumerable]::SequenceEqual($one, $two)) {
$equal = $false
}
} while ($equal -and $bytesRead -eq $bufferSize)
$fs1.Close()
$fs2.Close()
return $equal
}
Buffering works
So what happens when we plug different values into $bufferSize
? This is what I got on my machine by comparing the Wars Of Liberty bin of 140MB with itself:
Buffer in bytes | Execution time in seconds | Ratio |
---|---|---|
2.097.152 | 2,4809 | 1,0210 |
1.048.576 | 2,4400 | 1,0042 |
524.288 | 2,4299 | 1,0000 |
262.144 | 2,4357 | 1,0024 |
131.072 | 2,4960 | 1,0272 |
65.536 | 2,5302 | 1,0413 |
32.768 | 2,6458 | 1,0888 |
16.384 | 2,9035 | 1,1949 |
8.192 | 4,3234 | 1,7793 |
4.096 | 5,4670 | 2,2499 |
2.048 | 7,6858 | 3,1630 |
1.024 | 12,0329 | 4,9520 |
A buffer that is large enough will make your file compare way faster, that's why I settled on 524.288 as a number for this PowerShell function. When I compare the entire setup file of Wars of Liberty of 2.2GB it took me 42 seconds.
So what about a byte-by-byte comparison? That one took me 32 seconds on the 140MB file, so that's >13x slower than using a buffer of 524.288 bytes.
Just need a script?
If you just need a script, copy this text to compare.ps1
and run it on the command-line like .\compare.ps1 .\article.html .\article2.html
. The result will be printed to the console.
<#
.SYNOPSIS
Compares two files. Returns True if the files are equal.
.DESCRIPTION
Compares two files. Returns True if the files are equal; otherwise False.
Use the bufferSize to optimize for speed. Might depend on your system.
.PARAMETER file1
The first file.
.PARAMETER file2
The second file.
.PARAMETER bufferSize
The size of the buffer will influence the speed of the script.
.OUTPUTS
True when equal otherwise False.
.LINK
More info: https://keestalkstech.com/2013/01/comparing-files-with-powershell/
#>
param(
[Parameter(Mandatory = $true)]
[string]
$file1,
[Parameter(Mandatory = $true)]
[string]
$file2,
[uint32]
$bufferSize = 524288)
$ErrorActionPreference = "Stop"
$PSDefaultParameterValues['*:ErrorAction']='Stop'
$first = Get-Item $file1
$second = Get-Item $file2
if ( $first.Length -ne $second.Length ) { return $false }
if ( $bufferSize -eq 0 ) { $bufferSize = 524288 }
$fs1 = $first.OpenRead()
$fs2 = $second.OpenRead()
$one = New-Object byte[] $bufferSize
$two = New-Object byte[] $bufferSize
$equal = $true
do {
$bytesRead = $fs1.Read($one, 0, $bufferSize)
$fs2.Read($two, 0, $bufferSize) | out-null
if ( -Not [System.Linq.Enumerable]::SequenceEqual($one, $two)) {
$equal = $false
}
} while ($equal -and $bytesRead -eq $bufferSize)
$fs1.Close()
$fs2.Close()
return $equal
Further reading
While working on the subject I found some interesting reads:
- High Performance PowerShell with LINQ - shows how to use LINQ instead of native PowerShell.
- Comparing two byte arrays in .NET - discussion on doing fast byte array comparison in .NET. Most upvotes go to
Enumerable.SequenceEqual
, but there are faster ways.
Improvements
2020-10-10: added the Just need a script? section.
2020-10-10: on some systems the first 2 if
statements did not work, according to caspertone2003; fixed it with his code.
2020-06-08: the original article used a byte-by-byte comparison, slowing things down on larger files. After writing the impact of the buffering on file streams, I rewrote this article to include buffer and .NET LINQ to improve the performance.
Great feeback. Happy to help!
The problem is that the file compare reads only 8 bytes per iteration.
Just set $BYTES_TO_READ = 32768 and all the slowness goes away :-) @Kees, thanks for sharing your code!
Changed the code. Thanks!
Sorry to bother you again. I realized that changing $BYTES_TO_READ is not enough, because inside the loop the BitConverter calls only compare the first 8 Bytes (= one Int64) of the buffer. After some deliberation I settled for a second, inner loop that iterates over the byte arrays and individually compares every byte. This is reasonably fast, and it’s especially much faster than the ultra-slow compare-object cmdlet.
$byteArrayLength = $one.Length
for ($j = 0; $j -lt $byteArrayLength; $j = $j + 1)
{
if ($one[$j] -ne $two[$j])
{
$fs1.Close();
$fs2.Close();
return $false;
}
}
I have 1 sqlite database of 2.5GB
I made a copy of it and changed the value of a field in a table (added a character, the size of the DB is not increased).
The function tells me that the 2 files are the same … something is not working.
I entered the for loop modified by Patrick
Edit:
Using your original function the comparison works, but it is very slow.
Do you know how to correctly integrate the for by Patrick so that it stays fast?
Edit2:
Ok I have correctly integrated Patrick’s nested loop, but it is still too slow to compare such large files. Thanks anyway