在 BINARY 文件 (900mb - 4.5gb) 中搜索 byte[] 并获取偏移量的最快方法。 C＃

Question

基本上我想要一种更快且有效的方法来搜索二进制文件中的字节数组并获取偏移量。 byte[] 可以包含 5 - 50 个字节。它会读取 1 次搜索我有一个功能不能正常工作并且非常慢：

        static long ReadOneSrch(BinaryReader reader, byte[] bytes)
    {
        int b;
        long i = 0;
        while ((b = reader.BaseStream.ReadByte()) != -1)
        {
            if (b == bytes[i++])
            {
                if (i == bytes.Length)
                    return reader.BaseStream.Position - bytes.Length;
            }
            else
                i = b == bytes[0] ? 1 : 0;
        }

        return -1;
    }

Answer 1

这是我在流中使用 Boyer-Moore-Horspool 的实现。 BMH 的核心实现基本上是从 Boyer-Moore-Horspool Algorithm for All Matches (Find Byte array inside Byte array).

复制过来的

该方法重复将流读入缓冲区并对缓冲区应用 BMH 算法，直到我们获得匹配。为了同时找到跨越两个这样的读取的匹配项，我们总是将最后的 pattern.Length 字节从上一次读取传输到缓冲区的头部（可以通过评估已经排除的可能匹配开始来更聪明地完成一些努力 - 但是如果模式不是太长，你几乎不会注意到差异。

    /// <summary>
    /// Finds the first occurrence of <paramref name="pattern"/> in a stream
    /// </summary>
    /// <param name="s">The input stream</param>
    /// <param name="pattern">The pattern</param>
    /// <returns>The index of the first occurrence, or -1 if the pattern has not been found</returns>
    public static long IndexOf(Stream s, byte[] pattern)
    {
        // Prepare the bad character array is done once in a separate step
        var badCharacters = MakeBadCharArray(pattern);

        // We now repeatedly read the stream into a buffer and apply the Boyer-Moore-Horspool algorithm on the buffer until we get a match
        var buffer = new byte[Math.Max(2 * pattern.Length, 4096)];
        long offset = 0; // keep track of the offset in the input stream
        while (true)
        {
            int dataLength;
            if (offset == 0)
            {
                // the first time we fill the whole buffer
                dataLength = s.Read(buffer, 0, buffer.Length);
            }
            else
            {
                // Later, copy the last pattern.Length bytes from the previous buffer to the start and fill up from the stream
                // This is important so we can also find matches which are partly in the old buffer
                Array.Copy(buffer, buffer.Length - pattern.Length, buffer, 0, pattern.Length);
                dataLength = s.Read(buffer, pattern.Length, buffer.Length - pattern.Length) + pattern.Length;
            }

            var index = IndexOf(buffer, dataLength, pattern, badCharacters);
            if (index >= 0)
                return offset + index; // found!
            if (dataLength < buffer.Length)
                break;
            offset += dataLength - pattern.Length;
        }

        return -1;
    }

    // --- Boyer-Moore-Horspool algorithm ---
    // (Slightly modified code from
    // 
    // Prepare the bad character array is done once in a separate step:
    private static int[] MakeBadCharArray(byte[] pattern)
    {
        var badCharacters = new int[256];

        for (long i = 0; i < 256; ++i)
            badCharacters[i] = pattern.Length;

        for (var i = 0; i < pattern.Length - 1; ++i)
            badCharacters[pattern[i]] = pattern.Length - 1 - i;

        return badCharacters;
    }

    // Core of the BMH algorithm
    private static int IndexOf(byte[] value, int valueLength, byte[] pattern, int[] badCharacters)
    {
        int index = 0;

        while (index <= valueLength - pattern.Length)
        {
            for (var i = pattern.Length - 1; value[index + i] == pattern[i]; --i)
            {
                if (i == 0)
                    return index;
            }

            index += badCharacters[value[index + pattern.Length - 1]];
        }

        return -1;
    }

在 BINARY 文件 (900mb - 4.5gb) 中搜索 byte[] 并获取偏移量的最快方法。 C＃

The fastest method of searching a BINARY file(900mb - 4.5gb) for byte[] and grabbing the offset. C#

c#

binaryfiles

offset