从列表中删除重复项的最有效方法

Question

假设我有一个包含重复值的列表，我想删除重复项。

List<int> myList = new List<int>(Enumerable.Range(0, 10000));

// adding a few duplicates here
myList.Add(1); 
myList.Add(2);
myList.Add(3);

我找到了 3 种方法来解决这个问题：

List<int> result1 = new HashSet<int>(myList).ToList(); //3700 ticks
List<int> result2 = myList.Distinct().ToList(); //4700 ticks
List<int> result3 = myList.GroupBy(x => x).Select(grp => grp.First()).ToList(); //18800 ticks
//referring to pinturic's comment:
List<int> result4 = new SortedSet<int>(myList).ToList(); //18000 ticks

在 SO 的大多数答案中，Distinct 方法显示为 "correct one"，但 HashSet 总是更快！

我的问题：当我使用 HashSet 方法时，有什么我必须注意的吗？还有其他更有效的方法吗？

Answer 1

这两种方法有很大的区别：

List<int> Result1 = new HashSet<int>(myList).ToList(); //3700 ticks
List<int> Result2 = myList.Distinct().ToList(); //4700 ticks

第一个可以（很可能）改变返回的 List<> 元素的顺序：Result1 元素的顺序与 myList 的顺序不同那些。第二个保持原来的顺序。

可能没有比第一个更快的方法了。

可能没有"more correct"（对于"correct"的某个定义，基于排序）比第二个

(第三个和第二个差不多，只是慢一点)

出于好奇，Distinct() 是：

// Reference source http://referencesource.microsoft.com/#System.Core/System/Linq/Enumerable.cs,712
public static IEnumerable<TSource> Distinct<TSource>(this IEnumerable<TSource> source) {
    if (source == null) throw Error.ArgumentNull("source");
    return DistinctIterator<TSource>(source, null);
}

// Reference source http://referencesource.microsoft.com/#System.Core/System/Linq/Enumerable.cs,722
static IEnumerable<TSource> DistinctIterator<TSource>(IEnumerable<TSource> source, IEqualityComparer<TSource> comparer) {
    Set<TSource> set = new Set<TSource>(comparer);
    foreach (TSource element in source)
        if (set.Add(element)) yield return element;
}

所以最后 Distinct() 只是使用 HashSet<> 的内部实现（称为 Set<>）来检查项目的唯一性。

为了完整起见，我将在问题 Does C# Distinct() method keep original ordering of sequence intact?

中添加一个 link

从列表中删除重复项的最有效方法

Most efficient way to remove duplicates from a List

c#

list

distinct