有没有更快的方法来搜索大文件而不将其加载到内存中?

Is there any faster way to search in huge file without loading it into memory?

我有一个巨大的文件无法加载到内存中,并且我需要在其中找到一个字节序列。

这是我现在用的:

public static byte[] GetRangeFromStream(ref FileStream fs, long start_index, long count)
{
    byte[] data = new byte[count];
    long prev_pos = fs.Position;
    fs.Position = start_index;
    fs.Read(data, 0, data.Length);
    fs.Position = prev_pos;
    return data;
}

public static long GetSequenceIndexInFileStream(byte[] seq, ref FileStream fs, long index, bool get_beginning = true)
{
    if (index >= fs.Length)
        return -1;

    fs.Position = index;

    for (long i = index; i < fs.Length; i++)
    {
        byte temp_byte = (byte)fs.ReadByte();

        if (temp_byte == seq[0] && IsArraysSame(seq, GetRangeFromStream(ref fs, i, seq.Length))) //compare just first bytes and then compare seqs if needed
            return i;
    }

    return -1;
}

这样做的最佳方法是逐字节读取文件流,只查找您要搜索的字符串的第一个字节。当你得到匹配时,你知道你有可能命中。读取接下来的 N 个字节(其中 N 是要搜索的字符串的长度)并对从文件中读取的字节内容和要搜索的字节进行数组比较。

让 .net 在您打开它时通过设置适当的 FileStream 缓冲区来处理文件读取缓冲区。不要担心阅读转发来创建自己的缓冲区 - 这是浪费时间 - .net 做的很好。

这种方法意味着您不是比较每个字节的数组,并且您不关心缓冲区之间的拆分等。

如果文件非常大而您未 I/O 绑定,那么您可以考虑使用 .net 任务库创建多个任务,每个任务都从文件流中的不同位置开始,并关联结果完成所有任务后的每个任务。

我建议使用 Knuth Moriss Pratt algorithm 的修改版本。 算法kmp_search: 输入: 字符流,S(要搜索的文本) 一组字符,W(搜索的单词) 输出: 一个整数(在 S 中找到 W 的从零开始的位置)

define variables:
    an integer, m ← 0 (the beginning of the current match in S)
    an integer, i ← 0 (the position of the current character in W)
    an array of integers, T (the table, computed elsewhere)

while m + i < length(S) do
    if W[i] = S[m + i] then
        if i = length(W) - 1 then
            return m
        let i ← i + 1
    else
        if T[i] > -1 then
            let m ← m + i - T[i], i ← T[i]
        else
            let m ← m + 1, i ← 0

(if we reach here, we have searched all of S unsuccessfully)
return the length of S

The text string can be streamed in because the KMP algorithm does not backtrack in the text. (This is another improvement over the naive algorithm, which doesn’t naturally support streaming.) If streaming, the amortized time to process an incoming character is Ɵ(1) but the worst-case time is Ɵ(min(m, n′)), where n′ is the number of text characters seen so far. Source

可以找到参考 (Java) 实现 here

package com.twitter.elephantbird.util;

import java.io.IOException;
import java.io.InputStream;
import java.util.Arrays;

/**
 * An efficient stream searching class based on the Knuth-Morris-Pratt algorithm.
 * For more on the algorithm works see: http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/kmpen.htm.
 */
public class StreamSearcher {

  protected byte[] pattern_;
  protected int[] borders_;

  // An upper bound on pattern length for searching. Results are undefined for longer patterns.
  public static final int MAX_PATTERN_LENGTH = 1024;

  public StreamSearcher(byte[] pattern) {
    setPattern(pattern);
  }

  /**
   * Sets a new pattern for this StreamSearcher to use.
   * @param pattern
   *          the pattern the StreamSearcher will look for in future calls to search(...)
   */
  public void setPattern(byte[] pattern) {
    pattern_ = Arrays.copyOf(pattern, pattern.length);
    borders_ = new int[pattern_.length + 1];
    preProcess();
  }

  /**
   * Searches for the next occurrence of the pattern in the stream, starting from the current stream position. Note
   * that the position of the stream is changed. If a match is found, the stream points to the end of the match -- i.e. the
   * byte AFTER the pattern. Else, the stream is entirely consumed. The latter is because InputStream semantics make it difficult to have
   * another reasonable default, i.e. leave the stream unchanged.
   *
   * @return bytes consumed if found, -1 otherwise.
   * @throws IOException
   */
  public long search(InputStream stream) throws IOException {
    long bytesRead = 0;

    int b;
    int j = 0;

    while ((b = stream.read()) != -1) {
      bytesRead++;

      while (j >= 0 && (byte)b != pattern_[j]) {
        j = borders_[j];
      }
      // Move to the next character in the pattern.
      ++j;

      // If we've matched up to the full pattern length, we found it.  Return,
      // which will automatically save our position in the InputStream at the point immediately
      // following the pattern match.
      if (j == pattern_.length) {
        return bytesRead;
      }
    }

    // No dice, Note that the stream is now completely consumed.
    return -1;
  }

  /**
   * Builds up a table of longest "borders" for each prefix of the pattern to find. This table is stored internally
   * and aids in implementation of the Knuth-Moore-Pratt string search.
   * <p>
   * For more information, see: http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/kmpen.htm.
   */
  protected void preProcess() {
    int i = 0;
    int j = -1;
    borders_[i] = j;
    while (i < pattern_.length) {
      while (j >= 0 && pattern_[i] != pattern_[j]) {
        j = borders_[j];
      }
      borders_[++i] = ++j;
    }
  }
}

类似问题:Efficient way to search a stream for a string

您可能想查看 file mapping。它允许您基本上将大文件视为内存缓冲区,从而允许在 磁盘文件 上使用 任何基于内存的 API ].不要乱搞显式文件 I/O.

MSDN:

File mapping is the association of a file's contents with a portion of the virtual address space of a process. The system creates a file mapping object (also known as a section object) to maintain this association. A file view is the portion of virtual address space that a process uses to access the file's contents. File mapping allows the process to use both random input and output (I/O) and sequential I/O. It also allows the process to work efficiently with a large data file, such as a database, without having to map the whole file into memory. Multiple processes can also use memory-mapped files to share data....tell me more...