Seek Position of a String in a Stream

Yesterday I was working on a bit of code that had to read the XMP meta data from a file. It is not located at a certain position, so I had to scan the file. XMP, being plain XML, can be found by simple string matching. After some searching I’ve found many solutions that read the entire file into memory and perform a regular expression search or a string comparison. That’s not going to work for me, because I have files that are +100MB! So I wrote some code that searches for a string in a stream.

Hello from 2020! The original article was written in 2010 in .NET Framework. Now, 10 years later we have .NET Core 3.1 / .NET Framework 4.7 and a huge performance improvement with Span<T>, so a rewrite was in order.

And hello from 2022, the example now includes .NET 6 code with an HttpClient instead of a deprecated WebClient.

Encoding, encoding, encoding!

When you search for a string in a byte array, you must know the encoding that was used to write that string. Unicode UTF-32 encoding uses 4 bytes (32 bits) per character, while ASCII uses a single byte per character. So, let's be explicit about encoding:

using System.Text;

public static long Seek(Stream stream, string str, Encoding encoding)
{
    var search = encoding.GetBytes(str);
    return Seek(stream, search);
}

Performance: use a buffer

We could read the stream byte-for-byte, but it is usually faster to read a number of bytes at the same time. We will read a buffer size of 1024 bytes (or double the bytes of the string we are looking for).

But... we need to be careful when we read "the next buffer". The previous buffer might have had a partial match. To account for this, we copy the last n bytes to the beginning of the buffer -- were n is the length of the search. In the loop we fill the buffer with the rest of bytes.

Seek string in stream

Let's look at the code:

public static long Seek(Stream stream, byte[] search)
{
    int bufferSize = 1024;
    if (bufferSize < search.Length * 2) bufferSize = search.Length * 2;

    var buffer = new byte[bufferSize];
    var size = bufferSize;
    var offset = 0;
    var position = stream.Position;

    while (true)
    {
        var r = stream.Read(buffer, offset, size);

        // when no bytes are read -- the string could not be found
        if (r <= 0) return -1;

        // when less then size bytes are read, we need to slice
        // the buffer to prevent reading of "previous" bytes
        ReadOnlySpan<byte> ro = buffer;
        if (r < size)
        {
            ro = ro.Slice(0, offset + size);
        }

        // check if we can find our search bytes in the buffer
        var i = ro.IndexOf(search);
        if (i > -1) return position + i;

        // when less then size was read, we are done and found nothing
        if (r < size) return -1;

        // we still have bytes to read, so copy the last search
        // length to the beginning of the buffer. It might contain
        // a part of the bytes we need to search for

        offset = search.Length;
        size = bufferSize - offset;
        Array.Copy(buffer, buffer.Length - offset, buffer, 0, offset);
        position += bufferSize - offset;
    }
}

The ReadOnlySpan<T>.IndexOf is very performant.

XMP POC

We started out with the problem of reading XMP from huge files. I cannot share those files, but I have a smaller example to show the proof of concept. This code will extract the XMP information:

var url = "https://keestalkstech.com/wp-content/uploads/2020/06/photo-with-xmp.jpg?1";

using var client = new HttpClient();
using var downloadStream = await client.GetStreamAsync(url);

using var stream = new MemoryStream();
await downloadStream.CopyToAsync(stream);

stream.Position = 0;
var enc = Encoding.UTF8;
var start = Seek(stream, "<x:xmpmeta", enc);
var end = Seek(stream, "<?xpacket", enc);

stream.Position = start;
var buffer = new byte[end - start];
stream.Read(buffer, 0, buffer.Length);
var xmp = enc.GetString(buffer);

It will show:

<x:xmpmeta xmlns:x='adobe:ns:meta/' x:xmptk='Image::ExifTool 10.40'>
<rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'>

 <rdf:Description rdf:about=''
  xmlns:dc='http://purl.org/dc/elements/1.1/'>
  <dc:creator>
   <rdf:Seq>
    <rdf:li>Chris Reid</rdf:li>
   </rdf:Seq>
  </dc:creator>
  <dc:rights>
   <rdf:Alt>
    <rdf:li xml:lang='x-default'>Unsplash, free to use</rdf:li>
   </rdf:Alt>
  </dc:rights>
  <dc:title>
   <rdf:Alt>
    <rdf:li xml:lang='x-default'>Python Code</rdf:li>
   </rdf:Alt>
  </dc:title>
 </rdf:Description>
</rdf:RDF>
</x:xmpmeta>

Finals thoughts

Span<T> makes code way easier (and faster) to interact with. Streams remain a hard thing, because not all streams like to have the Position property changed, as I discovered when I tried to work with an HTTP stream. When searching a string in a stream, we're talking about bytes, so we need to know the encoding!

Changelog

2022-09-18 Swapped out the WebClient for an HttpClient to be more compatible. Code is now written in .NET 6.
2020-06-07 Changed the article to reflect the latest insights and .NET Core 3.1.
2010-11-20 Initial article.