HashSet<T>.removeAll 方法出奇地慢

The HashSet<T>.removeAll method is surprisingly slow

Jon Skeet 最近在他的博客上提出了一个有趣的编程话题:"There's a hole in my abstraction, dear Liza, dear Liza"(强调已添加):

I have a set – a HashSet, in fact. I want to remove some items from it… and many of the items may well not exist. In fact, in our test case, none of the items in the "removals" collection will be in the original set. This sounds – and indeed is – extremely easy to code. After all, we’ve got Set<T>.removeAll to help us, right?

We specify the size of the "source" set and the size of the "removals" collection on the command line, and build both of them. The source set contains only non-negative integers; the removals set contains only negative integers. We measure how long it takes to remove all the elements using System.currentTimeMillis(), which isn’t the world most accurate stopwatch but is more than adequate in this case, as you’ll see. Here’s the code:

import java.util.*;
public class Test 
{ 
    public static void main(String[] args) 
    { 
       int sourceSize = Integer.parseInt(args[0]); 
       int removalsSize = Integer.parseInt(args[1]); 
        
       Set<Integer> source = new HashSet<Integer>(); 
       Collection<Integer> removals = new ArrayList<Integer>(); 
        
       for (int i = 0; i < sourceSize; i++) 
       { 
           source.add(i); 
       } 
       for (int i = 1; i <= removalsSize; i++) 
       { 
           removals.add(-i); 
       } 
        
       long start = System.currentTimeMillis(); 
       source.removeAll(removals); 
       long end = System.currentTimeMillis(); 
       System.out.println("Time taken: " + (end - start) + "ms"); 
    }
}

Let’s start off by giving it an easy job: a source set of 100 items, and 100 to remove:

c:UsersJonTest>java Test 100 100
Time taken: 1ms

Okay, so we hadn’t expected it to be slow… clearly we can ramp things up a bit. How about a source of one million items and 300,000 items to remove?

c:UsersJonTest>java Test 1000000 300000
Time taken: 38ms

Hmm. That still seems pretty speedy. Now I feel I’ve been a little bit cruel, asking it to do all that removing. Let’s make it a bit easier – 300,000 source items and 300,000 removals:

c:UsersJonTest>java Test 300000 300000
Time taken: 178131ms

Excuse me? Nearly three minutes? Yikes! Surely it ought to be easier to remove items from a smaller collection than the one we managed in 38ms?

有人可以解释为什么会这样吗?为什么 HashSet<T>.removeAll 方法这么慢?

该行为(在某种程度上)记录在 javadoc:

This implementation determines which is the smaller of this set and the specified collection, by invoking the size method on each. If this set has fewer elements, then the implementation iterates over this set, checking each element returned by the iterator in turn to see if it is contained in the specified collection. If it is so contained, it is removed from this set with the iterator's remove method. If the specified collection has fewer elements, then the implementation iterates over the specified collection, removing from this set each element returned by the iterator, using this set's remove method.

这在实践中意味着什么,当您调用 source.removeAll(removals); 时:

  • 如果removals集合的大小小于source,则调用HashSetremove方法,速度很快。

  • 如果 removals 集合的大小等于或大于 source,则调用 removals.contains,这对于 ArrayList 来说很慢。

快速修复:

Collection<Integer> removals = new HashSet<Integer>();

请注意,an open bug 与您描述的非常相似。底线似乎是它可能是一个糟糕的选择但不能更改,因为它记录在 javadoc 中。


作为参考,这是removeAll的代码(在Java 8 - 没有检查其他版本):

public boolean removeAll(Collection<?> c) {
    Objects.requireNonNull(c);
    boolean modified = false;

    if (size() > c.size()) {
        for (Iterator<?> i = c.iterator(); i.hasNext(); )
            modified |= remove(i.next());
    } else {
        for (Iterator<?> i = iterator(); i.hasNext(); ) {
            if (c.contains(i.next())) {
                i.remove();
                modified = true;
            }
        }
    }
    return modified;
}