具有更好停止世界过滤功能的分析器？

Question

我正在使用 Apache Mahout 创建 TFIDF 向量。我将 EnglishAnalyzer 指定为文档标记化的一部分，如下所示：

DocumentProcessor.tokenizeDocuments(documentsSequencePath, EnglishAnalyzer.class, tokenizedDocumentsPath, configuration);

这为我称为 business.txt 的文档提供了以下向量。我很惊讶地看到其中 have、on、i、e.g. 等无用的词。我的其他文档之一加载更多。

对我来说提高搜索词质量的最简单方法是什么？我知道可以向 EnglishAnalyzer 传递一个停用词列表，但是构造函数是通过反射调用的，所以我好像不能那样做。

我应该编写自己的分析器吗？我对如何编写分词器、过滤器等感到有点困惑。我可以将 EnglishAnalyzer 与我自己的过滤器一起使用吗？以这种方式子类化 EnglishAnalyzer 似乎是不可能的。

# document: tfidf-score term
business.txt: 109 comput
business.txt: 110 us
business.txt: 111 innov
business.txt: 111 profit
business.txt: 112 market
business.txt: 114 technolog
business.txt: 117 revolut
business.txt: 119 on
business.txt: 119 platform
business.txt: 119 strategi
business.txt: 120 logo
business.txt: 121 i
business.txt: 121 pirat
business.txt: 123 econom
business.txt: 127 creation
business.txt: 127 have
business.txt: 128 peopl
business.txt: 128 compani
business.txt: 134 idea
business.txt: 139 luxuri
business.txt: 139 synergi
business.txt: 140 disrupt
business.txt: 140 your
business.txt: 141 piraci
business.txt: 145 product
business.txt: 147 busi
business.txt: 168 funnel
business.txt: 176 you
business.txt: 186 custom
business.txt: 197 e.g
business.txt: 301 brand

Answer 1

您可以将自定义停用词集传递给 EnglishAnalyzer ctor。这个停用词列表通常是从一个文件中加载的，该文件是每行一个停用词的纯文本。那看起来像这样：

String stopFileLocation = "\path\to\my\stopwords.txt"; 
CharArraySet stopwords = StopwordAnalyzerBase.loadStopwordSet(
        Paths.get(StopFileLocation));
EnglishAnalyzer analyzer = new EnglishAnalyzer(stopwords);

我不，马上，看看你应该如何将 ctor 参数传递给你指定的 Mahout 方法。我真的不知道 Mahout。如果您做不到，那么是的，您可以通过复制 EnglishAnalyzer 创建自定义分析器，并在其中加载您自己的停用词。这是一个从文件加载自定义停用词列表的示例，没有词干排除（即，为了简洁起见，删除了词干排除内容）。

public final class EnglishAnalyzerCustomStops extends StopwordAnalyzerBase {
  private static String StopFileLocation = "\path\to\my\stopwords.txt"; 

  public EnglishAnalyzerCustomStops() throws IOException {
    super(StopwordAnalyzerBase.loadStopwordSet(Paths.get(StopFileLocation)));
  }

  protected TokenStreamComponents createComponents(String fieldName) {
    final Tokenizer source = new StandardTokenizer();
    TokenStream result = new StandardFilter(source);
    result = new EnglishPossessiveFilter(result);
    result = new LowerCaseFilter(result);
    result = new StopFilter(result, stopwords);
    result = new PorterStemFilter(result);
    return new TokenStreamComponents(source, result);
  }

  protected TokenStream normalize(String fieldName, TokenStream in) {
    TokenStream result = new StandardFilter(in);
    result = new LowerCaseFilter(result);
    return result;
  }
}

具有更好停止世界过滤功能的分析器？

EnglishAnalyzer with better stop world filtering?

lucene

tf-idf

mahout