存储在数据库中的长文本字符串的高效比较

Question

我将应用程序事件存储到数据库中，从不同的文本文件中提取。

活动对象如下：

public class LogEvent
{
    public DateTime DateTime { get; set; }
    public LogLevel Level { get; set; }
    public string Message { get; set; } //can be lengthy
}

请注意，我不拥有此结构，我无法将任何属性添加到原始生成的对象（但我可以扩展 class 并从中创建其他数据库列我掌握的信息）。

我的问题是我想确保我不会两次插入同一个事件，尽管它可以在不同的文件上复制。 DateTime+Level 属性不足以满足相等性：不同的事件可以同时发生。

因此，每当我将事件/事件列表插入数据库时，我需要与消息属性进行比较，由于潜在的字符串长度，这非常低效：它意味着我需要以一种或另一种方式传输已插入事件的 Message 属性，以便在本地将它们与数据库索引进行比较。

我想创建一个额外的属性 Hashcode 来存储 String.GetHashCode() of Message 属性。但是，我读到 here 这不是一个好的做法，因为 Hashcode 实现在程序执行中不稳定（和潜在的冲突，但这种风险可以接受）

所以，我遇到了以下问题：如何从长字符串构建比较值，它可以是确定性的，快速 compute/compare 反对，并且具有可接受的比率碰撞？。字符串最多可以包含几千个字符。

我知道 Jon Skeet 对类似问题的回答 here，但它已经很老了（将近 10 年），我想知道 2020 年是否有更好的方法！

感谢您的提示！

Answer 1

第 1 步. 按长度比较它们。它会切断其中的大部分。 Step 2. 比较第一个字符长度相同的字符串...等

Answer 2

扩展我的评论：使用 Murmur3 non-cryptographic 哈希算法。您可以从此处的 NuGet 获取它：https://www.nuget.org/packages/murmurhash/

不要使用 built-in GetHashCode() 因为，正如您推测的那样，在您的进程之外持久化是不安全的。
您可以（但不应该）使用 cryptographically-secure hash-functions 因为它们的计算成本很高 - 而且通常很慢（不一定 故意慢 ，但如果 SHA-256 计算起来微不足道，那么我会因为找到用于比特币挖掘的 SHA-256 哈希值而成为亿万富翁）。
- 而像 Murmur 这样的 hashing-functions 被设计为快速且公平 collision-resistant.

所以我会这样做：

编写一个函数，将您的 LogEntry 序列化为可重复使用的 MemoryStream，以便通过 MurmurHash 进行散列（NuGet 包 I linked-to 没有自动散列任何对象的能力 - 并且即使是这样，您也需要一个 rigidly-defined 散列操作——事实上，序列化 in-memory 是目前“最好的”方法）。如果您 re-use MemoryStream 这不会很贵。
将哈希存储在您的数据库中 and/or 缓存它 in-memory 以减少 IO 操作。

你的情况：

interface ILogEventHasher
{
    Int32 Compute32BitMurmurHash( LogEvent logEvent );
}

// Register this class as a singleton service in your DI container.
sealed class LogEventHasher : IDisposable
{
    private readonly MemoryStream ms = new MemoryStream();

    public Int32 Compute32BitMurmurHash( LogEvent logEvent )
    {
        if( logEvent is null ) throw new ArgumentNullException( nameof(logEvent) );

        this.ms.Position = 0;
        this.ms.Length   = 0; // This resets the length pointer, it doesn't deallocate memory.

        using( BinaryWriter wtr = new BinaryWriter( this.ms, Encoding.UTF8 ) )
        {
            wtr.Write( logEvent.DateTime );
            wtr.Write( logEvent.Level    );
            wtr.Write( logEvent.Message  );
        }

        this.ms.Position = 0; // This does NOT reset the Length pointer.

        using( Murmur32 mh = MurmurHash.Create32() )
        {
            Byte[] hash = mh.ComputeHash( this.ms );
            return BitConverter.ToInt32( hash ); // `hash` will be 4 bytes long.
        }

        // Reset stream state:
        this.ms.Position = 0;
        this.ms.Length = 0;

        // Shrink the MemoryStream if it's grown too large:
        const Int32 TWO_MEGABYTES = 2 * 1024 * 1024;
        if( this.ms.Capacity > TWO_MEGABYTES  )
        {
            this.ms.Capacity = TWO_MEGABYTES;
        }
    }

    public void Dispose()
    {
        this.ms.Dispose();
    }
}

要过滤 LogEvent 个实例 in-memory，只需使用 HashSet<( DateTime utc, Int32 hash )>.

我不建议使用 HashSet<Int32>（仅存储 Murmur hash-codes）因为使用 32 位 non-cryptographically-secure hash-code 不会提供我有足够的信心相信 hash-code 碰撞不会发生——但是将它与 DateTime 值结合起来让我有足够的信心（DateTime 值消耗 64 位，或 8 个字节——所以每个memoized LogEvent 将需要 12 个字节。给定 .NET 的 2GiB array/object 大小限制（并假设 HashSet load-factor 为 0.75）意味着您可以存储到 134,217,728 缓存 hash-codes in-memory。我希望这足够了！

这是一个例子：

interface ILogEventFilterService
{
    Boolean AlreadyLoggedEvent( LogEvent e );
}

// Register as a singleton service.
class HashSetLogEventFilter : ILogEventFilterService
{
    // Somewhat amusingly, internally this HashSet will use GetHashCode() - rather than our own hashes, because it's storing a kind of user-level "weak-reference" to a LogEvent in the form of a ValueTuple.

    private readonly HashSet<( DateTime utc, Int32 hash )> hashes = new HashSet<( DateTime utc, Int32 hash )>();

    private readonly ILogEventHasher hasher;

    public HashSetLogEventFilter( ILogEventHasher hasher )
    {
        this.hasher = hasher ?? throw new ArgumentNullException( nameof(hasher) );
    }

    public Boolean AlreadyLoggedEvent( LogEvent e )
    {
        if( e is null ) throw new ArgumentNullException( nameof(e) );

        if( e.DateTime.Kind != DateTimeKind.Utc )
        {
            throw new ArgumentException( message: "DateTime value must be in UTC.", paramName: nameof(e) );
        }

        Int32 murmurHash = this.hasher.HashLogEvent( e );
        
        var t = ( utc: e.DateTime, hash: murmurHash );
        
        return this.hashes.Add( t ) == false;
    }
}

如果您想直接在数据库中执行此操作，则为运行 MERGE 语句的 table-valued-parameter 定义自定义 user-defined-table-type形式：

CREATE TABLE dbo.LogEvents (
    Utc        datetime2(7)   NOT NULL,
    MurmurHash int            NOT NULL,
    LogLevel   int            NOT NULL,
    Message    nvarchar(4000) NOT NULL
);

MERGE INTO dbo.LogEvents AS tgt WITH ( HOLDLOCK ) -- Always MERGE with HOLDLOCK!!!!!
USING @tvp AS src ON src.DateTime = tgt.DateTime AND src.MurmurHash = tgt.MurmurHash
WHEN NOT MATCHED BY TARGET THEN
    INSERT(     Utc,     MurmurHash,     LogLevel,     Message )
    VALUES( src.Utc, src.MurmurHash, src.LogLevel, src.Message )
;

存储在数据库中的长文本字符串的高效比较

Efficient comparison of long text strings stored into a Database

c#

hashcode