如何在 python 中使用 ElasticSearch-dsl 自定义同义词标记过滤器?
How to custom a synonym token filter with ElasticSearch-dsl in python?
我正在尝试在 python 中使用 ElasticSearch-dsl 构建一个同义词标记过滤器,例如,当我尝试搜索 "tiny" 或 "little" 时,它也会 return 文章包括 "small"。
这是我的代码:
from elasticsearch_dsl import token_filter
# Connect to local host server
connections.create_connection(hosts=['127.0.0.1'])
spelling_tokenfilter = token_filter(
'my_tokenfilter', # Name for the filter
'synonym', # Synonym filter type
synonyms_path = "analysis/wn_s.pl"
)
# Create elasticsearch object
es = Elasticsearch()
text_analyzer = analyzer('my_tokenfilter',
type='custom',
tokenizer='standard',
filter=['lowercase', 'stop', spelling_tokenfilter])
我在 es-7.6.2/config 中创建了一个名为 'analysis' 的文件夹并下载了 Wordnet prolog 数据库并将 'wn_s.pl' 复制并粘贴到其中。但是当我运行这个程序的时候,出现了一个错误:
Traceback (most recent call last):
File "index.py", line 161, in <module>
main()
File "index.py", line 156, in main
buildIndex()
File "index.py", line 74, in buildIndex
covid_index.create()
File "C:\Anaconda\lib\site-packages\elasticsearch_dsl\index.py", line 259, in create
return self._get_connection(using).indices.create(index=self._name, body=self.to_dict(), **kwargs)
File "C:\Anaconda\lib\site-packages\elasticsearch\client\utils.py", line 92, in _wrapped
return func(*args, params=params, headers=headers, **kwargs)
File "C:\Anaconda\lib\site-packages\elasticsearch\client\indices.py", line 104, in create
"PUT", _make_path(index), params=params, headers=headers, body=body
File "C:\Anaconda\lib\site-packages\elasticsearch\transport.py", line 362, in perform_request
timeout=timeout,
File "C:\Anaconda\lib\site-packages\elasticsearch\connection\http_urllib3.py", line 248, in perform_request
self._raise_error(response.status, raw_data)
File "C:\Anaconda\lib\site-packages\elasticsearch\connection\base.py", line 244, in _raise_error
status_code, error_message, additional_info
elasticsearch.exceptions.RequestError: RequestError(400, 'illegal_argument_exception', 'failed to build synonyms')
有人知道怎么解决吗?
谢谢!
之所以发生这种情况,是因为您在同义词过滤器 (docs):
之前定义了 lowercase
和 stop
标记过滤器
Elasticsearch will use the token filters preceding the synonym filter in a tokenizer chain to parse the entries in a synonym file. So, for example, if a synonym filter is placed after a stemmer, then the stemmer will also be applied to the synonym entries.
首先,让我们尝试通过捕获异常来获取有关错误的更多详细信息:
>>> text_analyzer = analyzer('my_tokenfilter',
... type='custom',
... tokenizer='standard',
... filter=[
... 'lowercase', 'stop',
... spelling_tokenfilter
... ])
>>>
>>> try:
... text_analyzer.simulate('blah blah')
... except Exception as e:
... ex = e
...
>>> ex
RequestError(400, 'illegal_argument_exception', {'error': {'root_cause': [{'type': 'illegal_argument_exception', 'reason': 'failed to build synonyms'}], 'type': 'illegal_argument_exception', 'reason': 'failed to build synonyms', 'caused_by': {'type': 'parse_exception', 'reason': 'Invalid synonym rule at line 109', 'caused_by': {'type': 'illegal_argument_exception', 'reason': 'term: course of action analyzed to a token (action) with position increment != 1 (got: 2)'}}}, 'status': 400})
这部分特别有意思:
'reason': 'Invalid synonym rule at line 109', 'caused_by': {'type': 'illegal_argument_exception', 'reason': 'term: course of action analyzed to a token (action) with position increment != 1 (got: 2)'}}}
这表明它设法找到了文件,但未能解析它。
最后,如果您删除这两个标记过滤器,错误就会消失:
text_analyzer = analyzer('my_tokenfilter',
type='custom',
tokenizer='standard',
filter=[
#'lowercase', 'stop',
spelling_tokenfilter
])
...
>>> text_analyzer.simulate("blah")
{'tokens': [{'token': 'blah', 'start_offset': 0, 'end_offset...}
文档建议使用 multiplexer token filter 以防您需要组合这些。
希望对您有所帮助!
我正在尝试在 python 中使用 ElasticSearch-dsl 构建一个同义词标记过滤器,例如,当我尝试搜索 "tiny" 或 "little" 时,它也会 return 文章包括 "small"。 这是我的代码:
from elasticsearch_dsl import token_filter
# Connect to local host server
connections.create_connection(hosts=['127.0.0.1'])
spelling_tokenfilter = token_filter(
'my_tokenfilter', # Name for the filter
'synonym', # Synonym filter type
synonyms_path = "analysis/wn_s.pl"
)
# Create elasticsearch object
es = Elasticsearch()
text_analyzer = analyzer('my_tokenfilter',
type='custom',
tokenizer='standard',
filter=['lowercase', 'stop', spelling_tokenfilter])
我在 es-7.6.2/config 中创建了一个名为 'analysis' 的文件夹并下载了 Wordnet prolog 数据库并将 'wn_s.pl' 复制并粘贴到其中。但是当我运行这个程序的时候,出现了一个错误:
Traceback (most recent call last):
File "index.py", line 161, in <module>
main()
File "index.py", line 156, in main
buildIndex()
File "index.py", line 74, in buildIndex
covid_index.create()
File "C:\Anaconda\lib\site-packages\elasticsearch_dsl\index.py", line 259, in create
return self._get_connection(using).indices.create(index=self._name, body=self.to_dict(), **kwargs)
File "C:\Anaconda\lib\site-packages\elasticsearch\client\utils.py", line 92, in _wrapped
return func(*args, params=params, headers=headers, **kwargs)
File "C:\Anaconda\lib\site-packages\elasticsearch\client\indices.py", line 104, in create
"PUT", _make_path(index), params=params, headers=headers, body=body
File "C:\Anaconda\lib\site-packages\elasticsearch\transport.py", line 362, in perform_request
timeout=timeout,
File "C:\Anaconda\lib\site-packages\elasticsearch\connection\http_urllib3.py", line 248, in perform_request
self._raise_error(response.status, raw_data)
File "C:\Anaconda\lib\site-packages\elasticsearch\connection\base.py", line 244, in _raise_error
status_code, error_message, additional_info
elasticsearch.exceptions.RequestError: RequestError(400, 'illegal_argument_exception', 'failed to build synonyms')
有人知道怎么解决吗? 谢谢!
之所以发生这种情况,是因为您在同义词过滤器 (docs):
之前定义了lowercase
和 stop
标记过滤器
Elasticsearch will use the token filters preceding the synonym filter in a tokenizer chain to parse the entries in a synonym file. So, for example, if a synonym filter is placed after a stemmer, then the stemmer will also be applied to the synonym entries.
首先,让我们尝试通过捕获异常来获取有关错误的更多详细信息:
>>> text_analyzer = analyzer('my_tokenfilter',
... type='custom',
... tokenizer='standard',
... filter=[
... 'lowercase', 'stop',
... spelling_tokenfilter
... ])
>>>
>>> try:
... text_analyzer.simulate('blah blah')
... except Exception as e:
... ex = e
...
>>> ex
RequestError(400, 'illegal_argument_exception', {'error': {'root_cause': [{'type': 'illegal_argument_exception', 'reason': 'failed to build synonyms'}], 'type': 'illegal_argument_exception', 'reason': 'failed to build synonyms', 'caused_by': {'type': 'parse_exception', 'reason': 'Invalid synonym rule at line 109', 'caused_by': {'type': 'illegal_argument_exception', 'reason': 'term: course of action analyzed to a token (action) with position increment != 1 (got: 2)'}}}, 'status': 400})
这部分特别有意思:
'reason': 'Invalid synonym rule at line 109', 'caused_by': {'type': 'illegal_argument_exception', 'reason': 'term: course of action analyzed to a token (action) with position increment != 1 (got: 2)'}}}
这表明它设法找到了文件,但未能解析它。
最后,如果您删除这两个标记过滤器,错误就会消失:
text_analyzer = analyzer('my_tokenfilter',
type='custom',
tokenizer='standard',
filter=[
#'lowercase', 'stop',
spelling_tokenfilter
])
...
>>> text_analyzer.simulate("blah")
{'tokens': [{'token': 'blah', 'start_offset': 0, 'end_offset...}
文档建议使用 multiplexer token filter 以防您需要组合这些。
希望对您有所帮助!