在具有该短语所有可能近似值的字符串中搜索 word/phrase

Question

假设我有以下字符串：

string = 'machine learning ml is a type of artificial intelligence ai that allows software applications to become more accurate at predicting outcomes without being explicitly programmed to do so machine12 learning algorithms use historical data as input to predict new output values machines learning is good'

进一步假设我有一个标签定义为：

tag = 'machine learning'

现在我想在我的字符串中找到标签。从我的 string 可以看出，我有三个地方 machine learning，一个在 string 的开头，一个在 machine12 learning，最后一个在 [=] 22=]。我希望找到所有这些并将输出列表制作为

['machine learning', 'machine12 learning', 'machines learning']

为了能够做到这一点，我尝试使用 nltk 标记我的标签。即

tag_token = nltk.word_tokenize(tag)

然后我会['machine','learning']。然后我会搜索 tag[0].

我知道 string.find(tag_token[0]) 和 data.rfind(tag_token[0]) 会给出第一个和最后一个发现的 machine 的位置，但是如果我在文本中有更多 machine learning 怎么办（这里有 3 个）?

那样的话我就没法全部提取出来了。所以我最初的想法是找到 machine 的所有出现，然后 learning 会失败。我希望使用 fuzzywuzzy 然后分析关于标签的 ['machine learning', 'machine12 learning', 'machines learning']。

那么我的问题就出现了 string 我有，我如何搜索标签及其近似值并按如下方式列出它们？

['machine learning', 'machine12 learning', 'machines learning']

更新： 我现在知道我可以执行以下操作：

pattern = re.compile(r"(machine[\s0-9]+learning)",re.IGNORECASE)
matches = pattern.findall(data)
#[output]: ['machine learning', 'machine12 learning']

如果我也这样做

pattern = re.compile(r"(machine[\sA-Za-z]+learning)",re.IGNORECASE)
matches = pattern.findall(data)
#[output]: ['machine learning', 'machines learning']

但可以肯定的是，这不是一个通用的解决方案。所以我想知道在这种情况下是否有一种聪明的搜索方式？

Answer 1

也许使用这样的模式 (string\w*)?

import re

string = 'machine 12 learning ml is a type of artificial intelligence ai that allows software applications to become more accurate at predicting outcomes without being explicitly programmed to do so machine12 learning algorithms use historical data as input to predict new output values machines learning is good'

tag_token=['machine','learning']

pattern='('+''.join(e+'\w*\s+(?:\S*\s+)?' for e in tag_token)[:-14]+')'

rgx=re.compile(pattern,re.IGNORECASE)
rgx.findall(string)
#output
#['machine 12 learning', 'machine12 learning', 'machines learning']

随着标签中单词位置的变化，找到匹配项会更加困难

此代码将从tag_token中找到所有组合。例如。 machine s learning and machine learning and machine12 12 learning and learning machine ...您还可以创建新的字符串和包含超过 2 个单词的新的 tag_token。将找到这些词的所有组合。

示例 tag_token = ['1', '2', '3'] 将匹配 1 2 3 和 1a 2 b 3 以及 2b2 1sss 3 和 333 2tt 1

import re
import itertools

string = 'machine 12 learning ml is a type of artificial intelligence ai that allows software applications to become more accurate at predicting outcomes without being explicitly programmed to do so machine12 learning algorithms use historical data as input to predict new output values machines learning is good. Learning machine can be used to train people. learning the machines is a great job'

tag_token=['machine','learning']

pattern='('
for current_tag in itertools.permutations(tag_token, len(tag_token)):
    pattern+=''.join(e+'\w*\s+(?:\S*\s+)?' for e in current_tag)[:-14]+'|'

pattern=pattern.rstrip('|')+')'
rgx=re.compile(pattern,re.IGNORECASE)

rgx.findall(string)

#output
#['machine 12 learning', 'machine12 learning', 'machines learning', 'Learning machine', 'learning the machines']

在具有该短语所有可能近似值的字符串中搜索 word/phrase

Searching for a word/phrase in a string with all the possible approximations of the phrase

python

regex

full-text-search

fuzzy-search

fuzzywuzzy