从具有 20,000 条记录的网格中快速查找重复记录，而无需快速访问数据库

Question

在我的测试自动化中，我无法访问 XML 或 database.I 想要查找网格中特定列的重复记录。我的网格有 20,000 records.The 唯一的问题是我们无法访问任何数据库，因此如果我更改页面，它会很慢，每个页面加载 50 records.There 是 20,000 条记录的性能问题。

Answer 1

创建一个 HashMap<Integer, ArrayList<YourObject>> - 每次你通过对象 Id 获得相同的对象时将其放入地图中并将其添加到 ArrayList

Answer 2

生成此结果后，您可以缓存它，这样就不必在每次访问页面时重新生成它。然而在 2 毫秒，你可能不会打扰。

这是一个计时的例子

static class MyRecord {
    String text;
    int id;
    double d;

    public MyRecord(String text, int id, double d) {
        this.text = text;
        this.id = id;
        this.d = d;
    }

    public int getId() {
        return id;
    }
}

public static void main(String[] args) {
    for (int t = 0; t < 100; t++) {
        long start = System.nanoTime();
        Random rand = new Random();
        Map<Integer, MyRecord> map = IntStream.range(0, 20000)
                .mapToObj(i -> new MyRecord("text-" + i, rand.nextInt(i+1), i))
                .collect(Collectors.groupingBy(MyRecord::getId, 
                        Collectors.reducing(null, (a, b) -> a == null ? b : a)));
        long time = System.nanoTime() - start;
        System.out.printf("Took %.1f ms to generate and collect duplicates%n", time/1e6);
    }
}

此测试需要 2.0 毫秒来生成和整理重复记录。您可以在 Java 7 中编写相同的代码，只是写的时间更长，但不会更慢。如果它不必生成记录，它会更快。

为了比较，我让它与

并发

Map<Integer, MyRecord> map = IntStream.range(0, 20000).parallel()
    .mapToObj(i -> new MyRecord("text-" + i, rand.nextInt(i+1), i))
    .collect(Collectors.groupingByConcurrent(MyRecord::getId,
            Collectors.reducing(null, (a, b) -> a == null ? b : a)));

但现在需要 16 毫秒。 :P

Answer 3

这是一个基本选项。出于演示目的，我创建了一个包含 20,000 多条记录的列表，然后检查其中的重复项 - 耗时 29 毫秒。

基本上，我们的想法是扫描您的值，并针对每个值验证它是否是唯一的 - 如果是，则将其放入您比较的 "unique" 桶中；否则 - 将其放入重复项桶中。

import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;
import java.util.Set;


public class FindDuplicates {

    /**
     * @param args
     */
    public static void main(String[] args) {

        List<String> values = new ArrayList<String>();
        Set<String> unique = new HashSet<String>();
        Set<String> duplicates = new HashSet<String>();

        values.add("1");
        values.add("2");
        values.add("3");

        for(int i=0;i<=20000;i++)
        {
            values.add(Integer.toString(i));
        }

        values.add("1");
        values.add("2");
        values.add("4");

        long before = System.currentTimeMillis();

        for(String str : values)
        {
            if(unique.contains(str))
            {
                duplicates.add(str);
            }
            else
            {
                unique.add(str);
            }
        }

        long after = System.currentTimeMillis();

        System.out.println("Processing time: " + (after-before));

        System.out.println("total values: " + values.size());
        System.out.println("total unique: " + unique.size());
        System.out.println("total duplicates: " + duplicates.size());
    }

}

从具有 20,000 条记录的网格中快速查找重复记录，而无需快速访问数据库

Find duplicate records from a grid that has 20,000 records without any access to DB fastly

c#

java

search-engine