Java

Question

我写了一个方法 processTrainDirectory 应该从给定目录导入和处理所有文本文件。单独处理每个文件大约需要相同的时间（90ms），但是当我使用批量导入给定目录的方法时，每个文件的时间逐渐增加（从 90ms 到 300 个文件后超过 4000ms）。批量导入方法如下：

public void processTrainDirectory(String folderPath, Category category) {
    File folder = new File(folderPath);
    File[] listOfFiles = folder.listFiles();
    if (listOfFiles != null) {
        for (File file : listOfFiles) {
            if (file.isFile()) {
                processTrainText(file.getPath(), category);
            }
        }
    }
    else {
        System.out.println(foo);
    }

}

正如我所说，方法 processTrainText 是针对目录中的每个文本文件调用的。当在 processTrainDirectory 内使用时，此方法花费的时间会逐渐增加。方法processTrainText如下：

 public void processTrainText(String path, Category category){
    trainTextAmount++;
    Map<String, Integer> text = prepareText(path);
    update(text, category);

}

我在 200 种不同的文本手册上调用了 processTrainText 200 次，这花费的时间是 200 * 90 毫秒。但是当我有一个包含 200 个文件的目录并使用 processTrainDirectory 时，它需要 90-92-96-104....3897-3940-4002ms，这要长得多。

第二次调用processTrainText时问题依旧；它不会重置。你知道为什么会这样或者是什么原因造成的，我该如何解决？

非常感谢任何帮助！

编辑：有人问其他被调用的方法做了什么，所以这里是我的 class BayesianClassifier 中所有使用的方法，为了澄清，所有其他方法都被删除，你可以在下面找到 class Category:

public class BayesianClassifier {
    private Map<String, Integer> vocabulary;
    private List<Category> categories;
    private int trainTextAmount;
    private int testTextAmount;
    private GUI gui;


    public Map<String, Integer> prepareText(String path) {
        String text = readText(path);
        String normalizedText = normalizeText(text);
        String[] tokenizedText = tokenizeText(normalizedText);
        return countText(tokenizedText);
    }

    public String readText(String path) {
        BufferedReader br;
        String result = "";
        try {

            br = new BufferedReader(new FileReader(path));
            StringBuilder sb = new StringBuilder();
            String line = br.readLine();

            while (line != null) {
                sb.append(line);
                sb.append("\n");
                line = br.readLine();
            }
            result = sb.toString();
            br.close();
        } catch (IOException e) {
            e.printStackTrace();

        }

        return result;
    }


    public Map<String, Integer> countText(String[] words){
        Map<String, Integer> result = new HashMap<>();
        for(int i=0; i < words.length; i++){
            if (!result.containsKey(words[i])){
                result.put(words[i], 1);
            }
            else {
                result.put(words[i], result.get(words[i]) + 1);
            }
        }
          return result;
    }

    public void processTrainText(String path, Category category){
        trainTextAmount++;
        Map<String, Integer> text = prepareText(path);
        update(text, category);   
    }

    public void update(Map<String, Integer> text, Category category) {
        category.addText();
        for (Map.Entry<String, Integer> entry : text.entrySet()){
            if(!vocabulary.containsKey(entry.getKey())){
                vocabulary.put(entry.getKey(), entry.getValue());
                category.updateFrequency(entry);
                category.updateProbability(entry);
                category.updatePrior();
            }

            else {
                vocabulary.put(entry.getKey(), vocabulary.get(entry.getKey()) + entry.getValue());
                category.updateFrequency(entry);
                category.updateProbability(entry);
                category.updatePrior();
            }

            for(Category cat : categories){
                if (!cat.equals(category)){
                    cat.addWord(entry.getKey());
                    cat.updatePrior();
                }
            }
        }
    }

    public void processTrainDirectory(String folderPath, Category category) {
        File folder = new File(folderPath);
        File[] listOfFiles = folder.listFiles();
        if (listOfFiles != null) {
            for (File file : listOfFiles) {
                if (file.isFile()) {
                    processTrainText(file.getPath(), category);
                }
            }
        }
        else {
            System.out.println(foo);
        }

    }

这是我的Categoryclass（所有不需要的方法都删掉澄清：

public class Category {
    private String categoryName;
    private double prior;
    private Map<String, Integer> frequencies;
    private Map<String, Double> probabilities;
    private int textAmount;
    private BayesianClassifier bc;

    public Category(String categoryName, BayesianClassifier bc){
        this.categoryName = categoryName;
        this.bc = bc;
        this.frequencies = new HashMap<>();
        this.probabilities = new HashMap<>();
        this.textAmount = 0;
        this.prior = 0.00;
    }

    public void addWord(String word){
        this.frequencies.put(word, 0);
        this.probabilities.put(word, 0.0);
    }

    public void updateFrequency(Map.Entry<String, Integer> entry){
        if(!this.frequencies.containsKey(entry.getKey())){
            this.frequencies.put(entry.getKey(), entry.getValue());
        }
        else {
            this.frequencies.put(entry.getKey(), this.frequencies.get(entry.getKey()) + entry.getValue());
        }
    }

    public void updateProbability(Map.Entry<String, Integer> entry){
        double chance = ((double) this.frequencies.get(entry.getKey()) + 1) / (sumFrequencies() + bc.getVocabulary().size());
        this.probabilities.put(entry.getKey(), chance);
    }

    public Integer sumFrequencies(){
        Integer sum = 0;
        for (Integer integer : this.frequencies.values()) {
            sum = sum + integer;
        }
        return sum;
    }  
}

Answer 1

这个方法是做什么的？

update(text, category);

如果它在做什么，可能是我的随机呼叫，而不是这可能是你的瓶颈。如果你在没有额外上下文的情况下以单一方式调用它并且它正在更新一些通用数据结构而不是它总是需要相同的时间。如果它更新了一些包含你过去迭代的数据的东西，我很确定它会花费越来越多的时间 - 然后检查 update() 方法的复杂性并减少你的瓶颈。

更新：你的方法 updateProbability 正在处理你迄今为止收集的所有数据，当你 你正在计算频率总和 - 因此你处理的文件越多，花费的时间就越多。这是你的瓶颈。不需要每次都计算它 - 只需保存它并在每次发生变化时更新它以最小化计算量。

Answer 2

看起来每个文件的时间呈线性增长，总时间呈二次方增长。这意味着对于每个文件，您都在处理所有先前文件的数据。确实，你是：

updateProbability调用sumFrequencies，贯穿整个frequencies，随每个文件增长。那是罪魁祸首。只需创建一个字段 int sumFrequencies 并在 `updateFrequency.

中更新它

作为进一步的改进，考虑使用 Guava Multiset, which does the counting in a simpler and more efficient way (no autoboxing). After fixing your code, consider letting it be reviewed on CR；它有很多小问题。

Java - 批处理文本文件的方法比单独执行相同次数的相同操作要慢得多

Java - Method for batch processing text files is much slower then the same action individually the same amount of times

methods

text

batch-processing

text-classification