Python 模糊元素检查

Question

我有一个 Python 2.7 set 对象，其中包含数据类别的名称，我希望能够进行某种形式的模糊元素检查，以查看用户给定输入的一部分是否是集合的元素。

这是一个玩具示例，用于解释我想要的内容。给定以下集合和用户输入：

SET = {'red_ball', 'green_ball', 'red_cup', 'green_cup'}
user_input = 'yellow ball'

我希望程序打印出如下内容：

'yellow_ball' not found, did you mean 'red_ball', or 'green_ball'?

到目前为止我有以下内容：

import re

SET = {'red_ball', 'green_ball', 'red_cup', 'green_cup'}
user_input = 'yellow ball'

# all members of my set are lowercase and separated by an underscore
user_input_list = user_input.lower().split() # for use in fuzzy search
user_input = "_".join(user_input_list) # convert to yellow_ball for element check
regex = None
matches = []

if user_input not in SET:
    # FUZZY ELEMENT CHECK
    for item in user_input_list:
        regex = re.compile(item)
        for element in SET:
            if regex.match(element):
                matches.append(element)

    if len(matches) > 0:
        print '\'%s\' not found, did you mean %s' % (user_input, ", ".join(['\'' + x + '\'' for x in matches]))
    else:
        print '\'%s\' not found.' % user_input

是否有更有效的方法，也许是使用第三方库？

感谢您的帮助，杰兰特

Answer 1

如果您对第 3 方模块感兴趣，我喜欢使用一个名为 fuzzywuzzy 的小模块，用于 Python 中的模糊字符串匹配。

该模块只需几行代码即可执行模糊字符串匹配。

以下是如何使用它的示例：

>>> from fuzzywuzzy import process
>>> choices = {'red_ball', 'green_ball', 'red_cup', 'green_cup'}
>>> query = 'yellow ball'

我们已经设置了我们的选择和输入，现在我们可以检索最接近的匹配项。

>>> process.extract(query, choices)
[('red_ball', 53), ('green_ball', 48), ('red_cup', 13), ('green_cup', 10)]

这returns所有选项按照匹配的接近程度降序排列。字符串之间的距离是使用 Levenshtein 距离度量计算的。如果原始输入不在选择集中，您可以提取前 n 项并将它们作为有效备选方案提出。

如果您只想要顶部匹配，只需这样做：

>>> process.extractOne(query, choices)
('red_ball', 53)

您可以阅读更多使用模糊字符串匹配的示例 here。

Answer 2

重写你的程序。删除了正则表达式。不知道你是否想要下划线或 space 作为单词分隔符（这很容易改变）。

SET = ( 'red ball', 'green ball', 'red cup', 'green cup')

# For each element in the set, build a list of words
WORDS = {}
for s in SET:
  WORDS[s] = list( s.split(' ') )

# get user input
user_input = 'yellow ball'

if user_input not in SET:
  # determine possible answers
  input_words = user_input.split(' ')
  other_answers = []
  for s in WORDS:
    if any(w in WORDS[s] for w in input_words):
      other_answers.append(s)
  # print result
  if len(other_answers) > 0:
    print "'%s' not found, did you mean %s" % (
      user_input, 
      ", or ".join(["'%s'" % oa for oa in other_answers])
    )
  else:
    print "'%s' not found" % user_input

Python 模糊元素检查

Python fuzzy element checking

python

fuzzy-search

set

python-2.7