如何在 python 中使用 ElasticSearch-dsl 自定义同义词标记过滤器?

How to custom a synonym token filter with ElasticSearch-dsl in python?

我正在尝试在 python 中使用 ElasticSearch-dsl 构建一个同义词标记过滤器,例如,当我尝试搜索 "tiny" 或 "little" 时,它也会 return 文章包括 "small"。 这是我的代码:

from elasticsearch_dsl import token_filter

# Connect to local host server
connections.create_connection(hosts=['127.0.0.1'])

spelling_tokenfilter = token_filter(
    'my_tokenfilter', # Name for the filter
    'synonym', # Synonym filter type
    synonyms_path = "analysis/wn_s.pl"
    )

# Create elasticsearch object
es = Elasticsearch()

text_analyzer = analyzer('my_tokenfilter',
                         type='custom',
                         tokenizer='standard',
                         filter=['lowercase', 'stop', spelling_tokenfilter])

我在 es-7.6.2/config 中创建了一个名为 'analysis' 的文件夹并下载了 Wordnet prolog 数据库并将 'wn_s.pl' 复制并粘贴到其中。但是当我运行这个程序的时候,出现了一个错误:

Traceback (most recent call last):
  File "index.py", line 161, in <module>
    main()
  File "index.py", line 156, in main
    buildIndex()
  File "index.py", line 74, in buildIndex
    covid_index.create()
  File "C:\Anaconda\lib\site-packages\elasticsearch_dsl\index.py", line 259, in create
    return self._get_connection(using).indices.create(index=self._name, body=self.to_dict(), **kwargs)
  File "C:\Anaconda\lib\site-packages\elasticsearch\client\utils.py", line 92, in _wrapped
    return func(*args, params=params, headers=headers, **kwargs)
  File "C:\Anaconda\lib\site-packages\elasticsearch\client\indices.py", line 104, in create
    "PUT", _make_path(index), params=params, headers=headers, body=body
  File "C:\Anaconda\lib\site-packages\elasticsearch\transport.py", line 362, in perform_request
    timeout=timeout,
  File "C:\Anaconda\lib\site-packages\elasticsearch\connection\http_urllib3.py", line 248, in perform_request
    self._raise_error(response.status, raw_data)
  File "C:\Anaconda\lib\site-packages\elasticsearch\connection\base.py", line 244, in _raise_error
    status_code, error_message, additional_info
elasticsearch.exceptions.RequestError: RequestError(400, 'illegal_argument_exception', 'failed to build synonyms')

有人知道怎么解决吗? 谢谢!

之所以发生这种情况,是因为您在同义词过滤器 (docs):

之前定义了 lowercasestop 标记过滤器

Elasticsearch will use the token filters preceding the synonym filter in a tokenizer chain to parse the entries in a synonym file. So, for example, if a synonym filter is placed after a stemmer, then the stemmer will also be applied to the synonym entries.

首先,让我们尝试通过捕获异常来获取有关错误的更多详细信息:

>>> text_analyzer = analyzer('my_tokenfilter',
...                          type='custom',
...                          tokenizer='standard',
...                          filter=[
...                              'lowercase', 'stop',
...                              spelling_tokenfilter
...                              ])
>>>
>>> try:
...   text_analyzer.simulate('blah blah')
... except Exception as e:
...   ex = e
...
>>> ex
RequestError(400, 'illegal_argument_exception', {'error': {'root_cause': [{'type': 'illegal_argument_exception', 'reason': 'failed to build synonyms'}], 'type': 'illegal_argument_exception', 'reason': 'failed to build synonyms', 'caused_by': {'type': 'parse_exception', 'reason': 'Invalid synonym rule at line 109', 'caused_by': {'type': 'illegal_argument_exception', 'reason': 'term: course of action analyzed to a token (action) with position increment != 1 (got: 2)'}}}, 'status': 400})

这部分特别有意思:

'reason': 'Invalid synonym rule at line 109', 'caused_by': {'type': 'illegal_argument_exception', 'reason': 'term: course of action analyzed to a token (action) with position increment != 1 (got: 2)'}}}

这表明它设法找到了文件,但未能解析它。

最后,如果您删除这两个标记过滤器,错误就会消失:

text_analyzer = analyzer('my_tokenfilter',
                         type='custom',
                         tokenizer='standard',
                         filter=[
                             #'lowercase', 'stop',
                             spelling_tokenfilter
                             ])
...
>>> text_analyzer.simulate("blah")
{'tokens': [{'token': 'blah', 'start_offset': 0, 'end_offset...}

文档建议使用 multiplexer token filter 以防您需要组合这些。

希望对您有所帮助!