具有更好停止世界过滤功能的分析器?

EnglishAnalyzer with better stop world filtering?

我正在使用 Apache Mahout 创建 TFIDF 向量。我将 EnglishAnalyzer 指定为文档标记化的一部分,如下所示:

DocumentProcessor.tokenizeDocuments(documentsSequencePath, EnglishAnalyzer.class, tokenizedDocumentsPath, configuration); 

这为我称为 business.txt 的文档提供了以下向量。我很惊讶地看到其中 haveonie.g. 等无用的词。我的其他文档之一加载更多。

对我来说提高搜索词质量的最简单方法是什么?我知道可以向 EnglishAnalyzer 传递一个停用词列表,但是构造函数是通过反射调用的,所以我好像不能那样做。

我应该编写自己的分析器吗?我对如何编写分词器、过滤器等感到有点困惑。我可以将 EnglishAnalyzer 与我自己的过滤器一起使用吗?以这种方式子类化 EnglishAnalyzer 似乎是不可能的。

# document: tfidf-score term
business.txt: 109 comput
business.txt: 110 us
business.txt: 111 innov
business.txt: 111 profit
business.txt: 112 market
business.txt: 114 technolog
business.txt: 117 revolut
business.txt: 119 on
business.txt: 119 platform
business.txt: 119 strategi
business.txt: 120 logo
business.txt: 121 i
business.txt: 121 pirat
business.txt: 123 econom
business.txt: 127 creation
business.txt: 127 have
business.txt: 128 peopl
business.txt: 128 compani
business.txt: 134 idea
business.txt: 139 luxuri
business.txt: 139 synergi
business.txt: 140 disrupt
business.txt: 140 your
business.txt: 141 piraci
business.txt: 145 product
business.txt: 147 busi
business.txt: 168 funnel
business.txt: 176 you
business.txt: 186 custom
business.txt: 197 e.g
business.txt: 301 brand

您可以将自定义停用词集传递给 EnglishAnalyzer ctor。这个停用词列表通常是从一个文件中加载的,该文件是每行一个停用词的纯文本。那看起来像这样:

String stopFileLocation = "\path\to\my\stopwords.txt"; 
CharArraySet stopwords = StopwordAnalyzerBase.loadStopwordSet(
        Paths.get(StopFileLocation));
EnglishAnalyzer analyzer = new EnglishAnalyzer(stopwords);

我不,马上,看看你应该如何将 ctor 参数传递给你指定的 Mahout 方法。我真的不知道 Mahout。如果您做不到,那么是的,您可以通过复制 EnglishAnalyzer 创建自定义分析器,并在其中加载您自己的停用词。这是一个从文件加载自定义停用词列表的示例,没有词干排除(即,为了简洁起见,删除了词干排除内容)。

public final class EnglishAnalyzerCustomStops extends StopwordAnalyzerBase {
  private static String StopFileLocation = "\path\to\my\stopwords.txt"; 

  public EnglishAnalyzerCustomStops() throws IOException {
    super(StopwordAnalyzerBase.loadStopwordSet(Paths.get(StopFileLocation)));
  }

  protected TokenStreamComponents createComponents(String fieldName) {
    final Tokenizer source = new StandardTokenizer();
    TokenStream result = new StandardFilter(source);
    result = new EnglishPossessiveFilter(result);
    result = new LowerCaseFilter(result);
    result = new StopFilter(result, stopwords);
    result = new PorterStemFilter(result);
    return new TokenStreamComponents(source, result);
  }

  protected TokenStream normalize(String fieldName, TokenStream in) {
    TokenStream result = new StandardFilter(in);
    result = new LowerCaseFilter(result);
    return result;
  }
}