使用 apache lucene 删除停用词时出现异常

Exception while using apache lucene for stop words removal

我正在使用以下代码从输入文本中删除停用词。 tokenStream.incrementToken() 运行时出现以下异常。

java.lang.IllegalStateException: TokenStream contract violation: reset()/close() call missing, reset() called multiple times, or subclass does not call super.reset(). Please see Javadocs of TokenStream class for more information about the correct consuming workflow.


public static String removeStopWords(String textFile) throws Exception {
        CharArraySet stopWords = EnglishAnalyzer.getDefaultStopSet();
        TokenStream tokenStream = new StandardTokenizer();
        tokenStream = new StopFilter(tokenStream, stopWords);
        StringBuilder sb = new StringBuilder();
        CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
        while (tokenStream.incrementToken()) {
            String term = charTermAttribute.toString();
            sb.append(term + " ");
        return sb.toString();

如下所示实例化您的 TokenStream -

TokenStream tokenStream = new StandardAnalyzer().tokenStream("field",new StringReader(textFile));