在忽略自定义标签的同时查找精确匹配

Finding exacts matches while ignoring custom tags

我正在使用一个索引，其中混合了多种文档，有些可能包含自定义标签，例如：

"Some long sentence <custom-tag attr="value" /> which ends here"

"Some long sentence <custom-tag attr="value" /> which ends <custom-tag-2 attr="value2" /> here"

"Another long sentence <another-custom-tag attr="value" /> which ends <another-custom-tag attr=value /> here"

我应该找到与标签名称和属性完全无关的精确匹配。构建这样一个假设的查询，我首先想到的是正则表达式，例如：

"Some long sentence regex(<[^>]*>? which ends here"

会return第一个文件，

"Some long sentence regex(<[^>]*>? which ends regex(<[^>]*>? here"

会return第二个文件。

这是我可以用 Lucene 3.x 实现的吗？我什至在考虑迁移到 Lucene 4.8 Beta 如果有理由的话。

有没有人处理过类似的事情？有没有我应该考虑的陷阱？

我想最简单的方法是存储相同的文本，但从第二个字段上的标签中剥离，然后在那个字段上执行搜索。我将不胜感激任何意见或建议。

你最好的选择（在任何版本中）是创建一个 TokenFilter 来识别 tag/regex 并从令牌流中忽略它们。

顺便说一句：我发现 "good" 从不存储字段（可能 "identifier" 字段除外。然后将对象序列化为二进制字段。这将 "index" 分开来自 "data"。在搜索速度和 IO 要求方面有一些好处

在忽略自定义标签的同时查找精确匹配

Finding exacts matches while ignoring custom tags

lucene

lucene.net

lucene.net.linq