无论如何优化一个大的(127K)阅读英文单词txt文件

Anyway to optimize a large (127K) reading english words txt file

这是我的功能:

public void addToList() throws IOException {
    String urlString = "http://web.stanford.edu/class/archive/cs/cs106l/cs106l.1102/assignments/dictionary.txt";
    URL url = new URL(urlString);
    Scanner scannerWords = new Scanner(url.openStream());
    while (scannerWords.hasNextLine()) {
        words.add(scannerWords.nextLine());
    }
}

需要: 执行时间为 32.8 秒。

无论如何我可以优化它(也许每 10 行阅读一次)?

  1. 一起获取所有数据
  2. 应用您的过滤器以获取预期的字词

  public static void main(String[] args) throws IOException {
       printWords(new ArrayList<>(150000));
    }

  private static void printWords(List<String> list) throws IOException {
        final long l = System.currentTimeMillis();
        String urlString = "http://web.stanford.edu/class/archive/cs/cs106l/cs106l.1102/assignments/dictionary.txt";
        URL url = new URL(urlString);
        final long l2;
        final long l3;
        Charset encoding=Charset.defaultCharset();
        try (Scanner scanner = new Scanner(url.openStream(), String.valueOf(encoding))) {
            l2 = System.currentTimeMillis();
            String content = scanner.useDelimiter("\A").next();
            list = Arrays.asList(content.split("\n"));
            l3 = System.currentTimeMillis();
            //System.out.println(list);
        }
        final long l4 = System.currentTimeMillis();
        System.out.println(String.format("Total Time: %d",l4-l));
        System.out.println(String.format("Data fetching Time: %d",l2-l));
        System.out.println(String.format("Data collection Time: %d",l3-l2));
    }

输出:

Total Time: 2482
Data fetching Time: 465
Data collection Time: 2017

这是我的尝试。我没有使用扫描仪,而是逐个字符地阅读。这减少了开销和使用 Scanner 的层次。

        String urlString = "http://web.stanford.edu/class/archive/cs/cs106l/cs106l.1102/assignments/dictionary.txt";
        InputStream stream = new URL(urlString).openStream();
        
        
        BufferedInputStream bufferedStream = new BufferedInputStream(stream);
        ArrayList<String> words = new ArrayList<>();
        char[] chars = new char[100];
        int index = 0;
        
        
        long currentTimeMillis = System.currentTimeMillis();
        while(true){
            int c = bufferedStream.read();
            if (c == '\n'){
                words.add(new String(chars, 0, index));
                index=0;
            } else if (c < 0){
                words.add(new String(chars, 0, index));
                break;
            } else {
                chars[index++]  = (char) c;
            }
        }
        long currentTimeMillis1 = System.currentTimeMillis();
        
        stream.close();
        
        System.out.println("Time       = " + (currentTimeMillis1-currentTimeMillis) + " ms");
        System.out.println("Word count = " + words.size());
        System.out.println( "First word = "  +  words.get(0));
        System.out.println( "Last word  = " + words.get(words.size()-1));

    }

输出

run:
Time       = 707 ms
Word count = 127142
First word = aa
Last word  = zyzzyvas
BUILD SUCCESSFUL (total time: 0 seconds)

好吧,显而易见的事情是每次你 运行 你的程序时只下载一次单词列表并使用本地副本而不是通过网络获取它。

你有一个泄漏,因为你永远不会关闭 URL.openStream() 返回的流(如果你将它更改为使用文件,那么你当前的代码也会存在同样的问题)。通过在循环后添加 scannerWords.close(); 很容易解决这个问题,但更好的异常安全方法是使用 try-with-resources.

我会完全放弃 Scanner,而只使用 BufferedReader。类似于:

import java.net.URL;
import java.util.*;
import java.util.stream.*;
import java.io.*;

// ...


private List<String> readLinesFromURL(String url) throws IOException {
    try (BufferedReader br
         = new BufferedReader(new InputStreamReader(new URL(url).openStream()))) {         
        return br
            .lines()
            .collect(Collectors.toCollection(ArrayList<String>::new));
    }
}