Solr 停用词似乎不起作用,在索引时删除了停用词,但在查询时仍然没有在邻近搜索中删除停用词
Solr stop words not seem to work , stop words are removed while indexing but still it at query time the stopwords are not removed in proximity search
我正在使用 solr 8.2.0。我正在尝试在我的 solr 中配置邻近搜索,但它似乎没有删除查询中的停用词。
<fieldType name="psearch" class="solr.TextField" positionIncrementGap="100" multiValued="true">
<analyzer type="index">
<tokenizer class="solr.ClassicTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.ClassicTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
</analyzer>
</fieldType>
我已经提到了目录中 stopwords.txt 文件中的停用词,在索引时 solr 正在删除这些词,如图所示:
indexed terms
我还在分析选项卡中检查过,停用词已被删除
Analysis tab
这是字段:
<field name="pSearchField" type="psearch" indexed="true" stored="true" multiValued="false" />
<copyField source="example" dest="pSearchField"/>
Searching with proximity
当我将接近度设置为 1、2 或 3 时,returns 没有结果:
result
这是 Solr 5 及更高版本的已知问题,因为它不再在调用停止过滤器时重写每个标记的位置。 SOLR-6468.
中跟踪了这个问题以及一些如何解决它的建议
最简单的解决方案是 to introduce a mapping char filter factory, but I'm skeptical to it changing characters internally in a string. (i.e. "to" => ""
also affecting veto
and not just to
). This can possible be handled with multiple PatternReplaceCharFilterFactories。
票证线程中显示的另一个选项是使用自定义过滤器重写每个令牌的位置数据:
package filters;
import java.io.IOException;
import java.util.Map;
import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
import org.apache.lucene.analysis.util.TokenFilterFactory;
public class RemoveTokenGapsFilterFactory extends TokenFilterFactory {
public RemoveTokenGapsFilterFactory(Map<String, String> args) {
super(args);
}
@Override
public TokenStream create(TokenStream input) {
RemoveTokenGapsFilter filter = new RemoveTokenGapsFilter(input);
return filter;
}
}
final class RemoveTokenGapsFilter extends TokenFilter {
private final PositionIncrementAttribute posIncrAtt = addAttribute(PositionIncrementAttribute.class);
public RemoveTokenGapsFilter(TokenStream input) {
super(input);
}
@Override
public final boolean incrementToken() throws IOException {
while (input.incrementToken()) {
posIncrAtt.setPositionIncrement(1);
return true;
}
return false;
}
}
据我所知,目前还没有针对此问题的完美内置解决方案。
我正在使用 solr 8.2.0。我正在尝试在我的 solr 中配置邻近搜索,但它似乎没有删除查询中的停用词。
<fieldType name="psearch" class="solr.TextField" positionIncrementGap="100" multiValued="true">
<analyzer type="index">
<tokenizer class="solr.ClassicTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.ClassicTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
</analyzer>
</fieldType>
我已经提到了目录中 stopwords.txt 文件中的停用词,在索引时 solr 正在删除这些词,如图所示: indexed terms
我还在分析选项卡中检查过,停用词已被删除 Analysis tab
这是字段:
<field name="pSearchField" type="psearch" indexed="true" stored="true" multiValued="false" />
<copyField source="example" dest="pSearchField"/>
Searching with proximity
当我将接近度设置为 1、2 或 3 时,returns 没有结果: result
这是 Solr 5 及更高版本的已知问题,因为它不再在调用停止过滤器时重写每个标记的位置。 SOLR-6468.
中跟踪了这个问题以及一些如何解决它的建议最简单的解决方案是 to introduce a mapping char filter factory, but I'm skeptical to it changing characters internally in a string. (i.e. "to" => ""
also affecting veto
and not just to
). This can possible be handled with multiple PatternReplaceCharFilterFactories。
票证线程中显示的另一个选项是使用自定义过滤器重写每个令牌的位置数据:
package filters;
import java.io.IOException;
import java.util.Map;
import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
import org.apache.lucene.analysis.util.TokenFilterFactory;
public class RemoveTokenGapsFilterFactory extends TokenFilterFactory {
public RemoveTokenGapsFilterFactory(Map<String, String> args) {
super(args);
}
@Override
public TokenStream create(TokenStream input) {
RemoveTokenGapsFilter filter = new RemoveTokenGapsFilter(input);
return filter;
}
}
final class RemoveTokenGapsFilter extends TokenFilter {
private final PositionIncrementAttribute posIncrAtt = addAttribute(PositionIncrementAttribute.class);
public RemoveTokenGapsFilter(TokenStream input) {
super(input);
}
@Override
public final boolean incrementToken() throws IOException {
while (input.incrementToken()) {
posIncrAtt.setPositionIncrement(1);
return true;
}
return false;
}
}
据我所知,目前还没有针对此问题的完美内置解决方案。