如何从 Apache Solr 索引中完全删除一个词？

Question

我是运行 Apache Solr 6.6.5。当用户搜索 "ETCS"（一个特殊的技术术语）时，所有文档都是包含单词 "etc" 的匹配项。但是我只想匹配真正包含"ETCS"的文档。 Solr 甚至不应该索引 "etc"，因为它是一个很常见的词。词干分析器永远不应将 "etc" 变成 "etcs"（复数词干提取）。

我将 "etc" 添加到 stopwords.txt:

# Contains words which shouldn't be indexed for fulltext fields, e.g., because
# they're too common. For documentation of the format, see
# http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.StopFilterFactory
# (Lines starting with a pound character # are ignored.)
etc

我将 "etc" 添加到 protwords.txt:

#-----------------------------------------------------------------------
# This file blocks words from being operated on by the stemmer and word delimiter.
&amp;
&lt;
&gt;
&#039;
&quot;
etc

这有助于不匹配包含 "etc" 的文档，但仍然匹配包含 "etc."、"etc," 或类似内容的文档。

所以我可以向 protwords.txt 添加更多变体：

&amp;
&lt;
&gt;
&#039;
&quot;
etc
etc.
etc..
etc...
etc,

但这永远是不完整的。我如何告诉词干分析器将 "etc" 视为带有任意非单词字符的标记化单词？

我的schema.xml：https://gist.github.com/klausi/f59ee47a9b14b915f5bb44bd6cf1c945

Answer 1

1.)

I added "etc" to protwords.txt:

您应该将 etcs 添加到 protwords 以保护术语 etcs 的词干。

2.)

So I could add even more variants to protwords.txt:

将您要从索引中删除的单词的所有变体添加到 stopwords.txt，而不是 protwords.txt

3.) 检查您使用的文件类型。也许你可以在这里稍微调整一下

//编辑：在您的 schema.xml 中添加 link 将无济于事，只要您不解释您使用的是哪个字段。

4.) 不要忘记重新启动并（如果需要）重新索引您的索引。

如何从 Apache Solr 索引中完全删除一个词？

How do you remove a word completely from an Apache Solr index?

solr

stemming