Elasticsearch 分词器保留(并连接)"and"
Elasticsearch tokenizer to keep (and concatenate) "and"
我正在尝试制作一个 Elasticsearch 过滤器、分析器和分词器,以便能够规范化搜索,例如:
"henry&william book"
-> "henrywilliam book"
"henry & william book"
-> "henrywilliam book"
"henry and william book"
-> "henrywilliam book"
"henry william book"
-> "henry william book"
换句话说,我想规范化我的“and”和“&”查询,同时连接它们之间的词。
我正在考虑制作一个将 "henry & william book"
分解为标记 ["henry & william", "book"]
的分词器,然后制作一个进行以下替换的字符过滤器:
" & "
-> ""
" and "
-> ""
"&"
-> ""
但是,这感觉有点老套。有更好的方法吗?
我不能完全在 analyzer/filter 阶段执行此操作的原因是它运行得太晚了。在我的尝试中,在我的 analyzer/filter 运行之前,Elasticsearch 已经将 "henry & william"
分解为 ["henry", "william"]
。
您可以巧妙地混合使用在分词器之前启动的两个字符过滤器。第一个字符过滤器会将 and
映射到 &
,第二个字符过滤器会去掉 &
并将两个相邻的标记粘合在一起。这种组合还允许您引入其他替代品,例如 |
和 or
。
PUT test
{
"settings": {
"analysis": {
"char_filter": {
"and": {
"type": "mapping",
"mappings": [
"and => &"
]
},
"&": {
"type": "pattern_replace",
"pattern": """(\w+)(\s*&\s*)(\w+)""",
"replacement": ""
}
},
"analyzer": {
"my-analyzer": {
"type": "custom",
"char_filter": [
"and", "&"
],
"tokenizer": "keyword"
}
}
}
}
}
这将产生以下结果:
POST test/_analyze
{
"analyzer": "my-analyzer",
"text": [
"henry&william book"
]
}
Results =>
{
"tokens" : [
{
"token" : "henrywilliam book",
"start_offset" : 0,
"end_offset" : 18,
"type" : "word",
"position" : 0
}
]
}
POST test/_analyze
{
"analyzer": "my-analyzer",
"text": [
"henry & william book"
]
}
Results =>
{
"tokens" : [
{
"token" : "henrywilliam book",
"start_offset" : 0,
"end_offset" : 18,
"type" : "word",
"position" : 0
}
]
}
POST test/_analyze
{
"analyzer": "my-analyzer",
"text": [
"henry and william book"
]
}
Results =>
{
"tokens" : [
{
"token" : "henrywilliam book",
"start_offset" : 0,
"end_offset" : 18,
"type" : "word",
"position" : 0
}
]
}
POST test/_analyze
{
"analyzer": "my-analyzer",
"text": [
"henry william book"
]
}
Results =>
{
"tokens" : [
{
"token" : "henry william book",
"start_offset" : 0,
"end_offset" : 18,
"type" : "word",
"position" : 0
}
]
}
您只需要一个字符过滤器和一些正则表达式知识。字符过滤器用于在将字符流传递到分词器之前对其进行预处理。
{
"settings": {
"analysis": {
"char_filter": {
"remove_and": {
"type": "pattern_replace",
"pattern": """\s*(&|\band\b)\s*""",
"description": "Removes ands and ampersands"
}
},
"analyzer": {
"book-analyzer": {
"type": "custom",
"char_filter": [
"remove_and"
],
"tokenizer": "keyword"
}
}
}
}
}
说明:
- \s* 表达式周围的可选空格
- \b 'and' 周围的单词边界,例如不要在 candy
中做出反应
我正在尝试制作一个 Elasticsearch 过滤器、分析器和分词器,以便能够规范化搜索,例如:
"henry&william book"
->"henrywilliam book"
"henry & william book"
->"henrywilliam book"
"henry and william book"
->"henrywilliam book"
"henry william book"
->"henry william book"
换句话说,我想规范化我的“and”和“&”查询,同时连接它们之间的词。
我正在考虑制作一个将 "henry & william book"
分解为标记 ["henry & william", "book"]
的分词器,然后制作一个进行以下替换的字符过滤器:
" & "
->""
" and "
->""
"&"
->""
但是,这感觉有点老套。有更好的方法吗?
我不能完全在 analyzer/filter 阶段执行此操作的原因是它运行得太晚了。在我的尝试中,在我的 analyzer/filter 运行之前,Elasticsearch 已经将 "henry & william"
分解为 ["henry", "william"]
。
您可以巧妙地混合使用在分词器之前启动的两个字符过滤器。第一个字符过滤器会将 and
映射到 &
,第二个字符过滤器会去掉 &
并将两个相邻的标记粘合在一起。这种组合还允许您引入其他替代品,例如 |
和 or
。
PUT test
{
"settings": {
"analysis": {
"char_filter": {
"and": {
"type": "mapping",
"mappings": [
"and => &"
]
},
"&": {
"type": "pattern_replace",
"pattern": """(\w+)(\s*&\s*)(\w+)""",
"replacement": ""
}
},
"analyzer": {
"my-analyzer": {
"type": "custom",
"char_filter": [
"and", "&"
],
"tokenizer": "keyword"
}
}
}
}
}
这将产生以下结果:
POST test/_analyze
{
"analyzer": "my-analyzer",
"text": [
"henry&william book"
]
}
Results =>
{
"tokens" : [
{
"token" : "henrywilliam book",
"start_offset" : 0,
"end_offset" : 18,
"type" : "word",
"position" : 0
}
]
}
POST test/_analyze
{
"analyzer": "my-analyzer",
"text": [
"henry & william book"
]
}
Results =>
{
"tokens" : [
{
"token" : "henrywilliam book",
"start_offset" : 0,
"end_offset" : 18,
"type" : "word",
"position" : 0
}
]
}
POST test/_analyze
{
"analyzer": "my-analyzer",
"text": [
"henry and william book"
]
}
Results =>
{
"tokens" : [
{
"token" : "henrywilliam book",
"start_offset" : 0,
"end_offset" : 18,
"type" : "word",
"position" : 0
}
]
}
POST test/_analyze
{
"analyzer": "my-analyzer",
"text": [
"henry william book"
]
}
Results =>
{
"tokens" : [
{
"token" : "henry william book",
"start_offset" : 0,
"end_offset" : 18,
"type" : "word",
"position" : 0
}
]
}
您只需要一个字符过滤器和一些正则表达式知识。字符过滤器用于在将字符流传递到分词器之前对其进行预处理。
{
"settings": {
"analysis": {
"char_filter": {
"remove_and": {
"type": "pattern_replace",
"pattern": """\s*(&|\band\b)\s*""",
"description": "Removes ands and ampersands"
}
},
"analyzer": {
"book-analyzer": {
"type": "custom",
"char_filter": [
"remove_and"
],
"tokenizer": "keyword"
}
}
}
}
}
说明:
- \s* 表达式周围的可选空格
- \b 'and' 周围的单词边界,例如不要在 candy 中做出反应