快速随机访问集合

Question

我正在使用半随机标记流。对于每个令牌，我都在维护大量数据（包括一些子集合）。

唯一令牌的数量没有限制，但实际上往往在 100,000-300,000 的数量级。

我从一个列表开始，并使用 Linq 查询确定要更新的适当标记对象。

public class Model {
    public List<State> States { get; set; }
    ...
}

var match = model.States.Where(x => x.Condition == stateText).SingleOrDefault();

在前 ~30k 个唯一令牌中，我能够找到并更新~1,100 tokens/sec。

性能分析表明，总 Cpu 周期中的 85% 花在了 Where(...).SingleOrDefault() 上（这是有道理的，列表是一种低效的搜索方式）。

因此，我将列表切换到 HashSet 并再次进行分析，确信 HashSet 能够更快地随机查找。这次，我只处理了 ~900 tokens/sec。在 Linq 上花费的时间几乎相同 (89%)。

所以...首先，我是否滥用了 HashSet？（使用 Linq 是否强制转换为 IEnumerable，然后是枚举/类似的东西？）

如果不是，我自己实现的最佳模式是什么？我的印象是 HashSet 已经进行了二进制搜索，所以我假设我需要构建某种树结构并具有更小的子集？

从评论中回答一些问题...条件是唯一的（如果我两次获得相同的标记，我想更新相同的条目），HashSet 是股票 .Net 实现（System.Collections.Generic.HashSet<T> ).

更广泛的代码视图是...

        var state = new RollingList(model.StateDepth); // Tracks last n items and drops older ones. (Basically an array and an index that wraps around
        var tokens = tokeniser.Tokenise(contents); // Iterator
        foreach (var token in tokens) {
            var stateText = StateToString(ref state);
            var match = model.States.Where(x => x.Condition == stateText).FirstOrDefault();
            // ... update the match as appropriate for the token
        }

Answer 1

var match = model.States.Where(x => x.Condition == stateText).SingleOrDefault();

如果您使用散列集做完全相同的事情，那就没有节省。哈希集针对快速回答问题进行了优化 "is this member in the set?" 而不是 "is there a member that makes this predicate true in the set?" 后者无论是哈希集还是列表都是线性时间。

可能满足您需求的数据结构：

创建一个从文本到状态的字典映射，然后在字典中对文本键进行搜索以获得结果状态。理论上搜索和插入的时间复杂度为 O(1)；实际上，它取决于散列的质量。
制作一个从文本到状态的排序字典映射。再次，搜索文本。已排序的字典使键在平衡树中排序，因此搜索和插入的时间复杂度为 O(log n)。

Answer 2

30k 不是那么多，所以如果状态是唯一的，你可以这样做。词典访问速度更快。

var statesDic = model.States.ToDictionary(x => x.Condition, x => x);
var match = statesDic.ConstainsKey(stateText) ? statesDic[stateText] : default(State);

引用 MSDN：

The Dictionary generic class provides a mapping from a set of keys to a set of values. Each addition to the dictionary consists of a value and its associated key. Retrieving a value by using its key is very fast, close to O(1), because the Dictionary class is implemented as a hash table.

您可以找到有关词典的更多信息 here。还要注意字典使用内存 space 来提高性能，您可以对 300k 项进行快速测试，看看我说的是哪种 space：

var memoryBeforeDic = GC.GetTotalMemory(true);
var dic = new Dictionary<string,object>(300000);
var memoryAfterDic = GC.GetTotalMemory(true);
Console.WriteLine("Memory: {0}", memoryAfterDic - memoryBeforeDic);

快速随机访问集合

Fast random access to a collection

c#

collections

performance

search

random-access