将一种数据结构与另一种数据结构进行比较，导致运行时间超过 50 分钟

Question

我正在编写读取文本文件（每行一条推文）的代码，然后遍历每条推文，将其与英语单词列表进行比较，以查看该单词是否拼写错误。

所以英语单词列表也是从文本文件中读入的，然后存储在一个列表中。当我运行单独编写代码时，它的运行时间不到一秒钟。当我运行将每个单词存储在 1,000,000 条推文的推文文件中（不检查拼写）的代码时，它会在大约 20-30 秒内将每个单词及其频率存储在 HashMap<String, Integer> 中。

但是当我添加检查单词是否拼写正确的行时，它会导致荒谬的运行时间增加，以至于我几乎可以在看完之前看完一部电影运行宁.

调用 isSpelledCorrectly(X) 的简单方面（仅调用 list.contains(x)，其最坏情况运行时间为 O(n)），但它似乎相当令人困惑它会导致代码从 30 秒运行时间变为 50 分钟运行时间?

代码：

拼写：

static List<String> spellCheck = new ArrayList<String>();

public AssignTwo() throws IOException{
    spellCheck = initCorrectSpelling("C:\Users\Gregs\InfoRetrieval\src\english-words");
}

public static List<String> initCorrectSpelling(String filename) throws IOException { //store correct spelling of words in list
    Scanner scanner = new Scanner(new FileInputStream(filename));
    try{
        while(scanner.hasNextLine()){
            String next = scanner.nextLine();
            spellCheck.add(next);
        }
    }
    finally{
        scanner.close();
    }
    return spellCheck;
}

public static boolean isSpelledCorrectly(String word){  //check if any given word is spelled correctly by seeing if it is 
    boolean output = false;                             //contained within the spellCheck list
    if(spellCheck.contains(word)) output = true;                    
    return output;
}

存储推文的代码：

public static HashMap<String, Integer> misSpell;
public AssignOne() throws IOException {         //read in file from path, test functions
    index("C:\Users\Gregs\InfoRetrieval\src\tweets");
}

public static void index(String filename)  throws IOException {
    misSpell = new HashMap<String, Integer>();
    Scanner scanner = new Scanner(new FileInputStream(filename));
    try{
        while(scanner.hasNextLine()){
            String line = scanner.nextLine();       
            String[] lineArr = line.split(" ");
            for(int i=3; i<lineArr.length; i++){
                int count=1;
                lineArr[i] = lineArr[i].replaceAll("[^a-zA-Z0-9]", "");
                //if(!AssignTwo.isSpelledCorrectly(lineArr[i].toLowerCase())){  //with this line commented out, runtime <30sec, with line >50mins
                    if(misSpell.containsKey(lineArr[i].toLowerCase())){             
                        count = 1 + misSpell.get(lineArr[i].toLowerCase());             
                    }
                    misSpell.put(lineArr[i].toLowerCase(), count);
                //}
            }
        }
    } 
    finally{
        scanner.close();
    }
}

关于在何处改进代码或如何使比较更有效的任何建议？是否有更快的正确拼写单词数据结构？

Answer 1

List.contains() 是 O(N)，N 是字典中的单词数。

使用哈希集，其中 contains() 是 O(1)。

使用缓冲 reader 也会加快速度。并且避免在每个单词上调用 toLowerCase() 三次。

将一种数据结构与另一种数据结构进行比较，导致运行时间超过 50 分钟

Comparing one data structure against another resulting in run time of over 50 mins

java

performance

hashmap

spelling

将一种数据结构与另一种数据结构进行比较，导致 运行 时间超过 50 分钟

Comparing one data structure against another resulting in run time of over 50 mins

java

performance

hashmap

spelling

将一种数据结构与另一种数据结构进行比较，导致运行时间超过 50 分钟