在具有该短语所有可能近似值的字符串中搜索 word/phrase
Searching for a word/phrase in a string with all the possible approximations of the phrase
假设我有以下字符串:
string = 'machine learning ml is a type of artificial intelligence ai that allows software applications to become more accurate at predicting outcomes without being explicitly programmed to do so machine12 learning algorithms use historical data as input to predict new output values machines learning is good'
进一步假设我有一个标签定义为:
tag = 'machine learning'
现在我想在我的字符串中找到标签。从我的 string
可以看出,我有三个地方 machine learning
,一个在 string
的开头,一个在 machine12 learning
,最后一个在 [=] 22=]。我希望找到所有这些并将输出列表制作为
['machine learning', 'machine12 learning', 'machines learning']
为了能够做到这一点,我尝试使用 nltk 标记我的标签。即
tag_token = nltk.word_tokenize(tag)
然后我会['machine','learning']
。然后我会搜索 tag[0]
.
我知道 string.find(tag_token[0])
和 data.rfind(tag_token[0])
会给出第一个和最后一个发现的 machine
的位置,但是如果我在文本中有更多 machine learning
怎么办(这里有 3 个)?
那样的话我就没法全部提取出来了。所以我最初的想法是找到 machine
的所有出现,然后 learning
会失败。我希望使用 fuzzywuzzy
然后分析关于标签的 ['machine learning', 'machine12 learning', 'machines learning']
。
那么我的问题就出现了 string
我有,我如何搜索标签及其近似值并按如下方式列出它们?
['machine learning', 'machine12 learning', 'machines learning']
更新: 我现在知道我可以执行以下操作:
pattern = re.compile(r"(machine[\s0-9]+learning)",re.IGNORECASE)
matches = pattern.findall(data)
#[output]: ['machine learning', 'machine12 learning']
如果我也这样做
pattern = re.compile(r"(machine[\sA-Za-z]+learning)",re.IGNORECASE)
matches = pattern.findall(data)
#[output]: ['machine learning', 'machines learning']
但可以肯定的是,这不是一个通用的解决方案。所以我想知道在这种情况下是否有一种聪明的搜索方式?
也许使用这样的模式 (string\w*)?
import re
string = 'machine 12 learning ml is a type of artificial intelligence ai that allows software applications to become more accurate at predicting outcomes without being explicitly programmed to do so machine12 learning algorithms use historical data as input to predict new output values machines learning is good'
tag_token=['machine','learning']
pattern='('+''.join(e+'\w*\s+(?:\S*\s+)?' for e in tag_token)[:-14]+')'
rgx=re.compile(pattern,re.IGNORECASE)
rgx.findall(string)
#output
#['machine 12 learning', 'machine12 learning', 'machines learning']
随着标签中单词位置的变化,找到匹配项会更加困难
此代码将从tag_token中找到所有组合。例如。
machine s learning
and machine learning
and machine12 12 learning
and learning machine
...您还可以创建新的字符串和包含超过 2 个单词的新的 tag_token。将找到这些词的所有组合。
示例 tag_token = ['1', '2', '3']
将匹配 1 2 3
和 1a 2 b 3
以及 2b2 1sss 3
和 333 2tt 1
import re
import itertools
string = 'machine 12 learning ml is a type of artificial intelligence ai that allows software applications to become more accurate at predicting outcomes without being explicitly programmed to do so machine12 learning algorithms use historical data as input to predict new output values machines learning is good. Learning machine can be used to train people. learning the machines is a great job'
tag_token=['machine','learning']
pattern='('
for current_tag in itertools.permutations(tag_token, len(tag_token)):
pattern+=''.join(e+'\w*\s+(?:\S*\s+)?' for e in current_tag)[:-14]+'|'
pattern=pattern.rstrip('|')+')'
rgx=re.compile(pattern,re.IGNORECASE)
rgx.findall(string)
#output
#['machine 12 learning', 'machine12 learning', 'machines learning', 'Learning machine', 'learning the machines']
假设我有以下字符串:
string = 'machine learning ml is a type of artificial intelligence ai that allows software applications to become more accurate at predicting outcomes without being explicitly programmed to do so machine12 learning algorithms use historical data as input to predict new output values machines learning is good'
进一步假设我有一个标签定义为:
tag = 'machine learning'
现在我想在我的字符串中找到标签。从我的 string
可以看出,我有三个地方 machine learning
,一个在 string
的开头,一个在 machine12 learning
,最后一个在 [=] 22=]。我希望找到所有这些并将输出列表制作为
['machine learning', 'machine12 learning', 'machines learning']
为了能够做到这一点,我尝试使用 nltk 标记我的标签。即
tag_token = nltk.word_tokenize(tag)
然后我会['machine','learning']
。然后我会搜索 tag[0]
.
我知道 string.find(tag_token[0])
和 data.rfind(tag_token[0])
会给出第一个和最后一个发现的 machine
的位置,但是如果我在文本中有更多 machine learning
怎么办(这里有 3 个)?
那样的话我就没法全部提取出来了。所以我最初的想法是找到 machine
的所有出现,然后 learning
会失败。我希望使用 fuzzywuzzy
然后分析关于标签的 ['machine learning', 'machine12 learning', 'machines learning']
。
那么我的问题就出现了 string
我有,我如何搜索标签及其近似值并按如下方式列出它们?
['machine learning', 'machine12 learning', 'machines learning']
更新: 我现在知道我可以执行以下操作:
pattern = re.compile(r"(machine[\s0-9]+learning)",re.IGNORECASE)
matches = pattern.findall(data)
#[output]: ['machine learning', 'machine12 learning']
如果我也这样做
pattern = re.compile(r"(machine[\sA-Za-z]+learning)",re.IGNORECASE)
matches = pattern.findall(data)
#[output]: ['machine learning', 'machines learning']
但可以肯定的是,这不是一个通用的解决方案。所以我想知道在这种情况下是否有一种聪明的搜索方式?
也许使用这样的模式 (string\w*)?
import re
string = 'machine 12 learning ml is a type of artificial intelligence ai that allows software applications to become more accurate at predicting outcomes without being explicitly programmed to do so machine12 learning algorithms use historical data as input to predict new output values machines learning is good'
tag_token=['machine','learning']
pattern='('+''.join(e+'\w*\s+(?:\S*\s+)?' for e in tag_token)[:-14]+')'
rgx=re.compile(pattern,re.IGNORECASE)
rgx.findall(string)
#output
#['machine 12 learning', 'machine12 learning', 'machines learning']
随着标签中单词位置的变化,找到匹配项会更加困难
此代码将从tag_token中找到所有组合。例如。
machine s learning
and machine learning
and machine12 12 learning
and learning machine
...您还可以创建新的字符串和包含超过 2 个单词的新的 tag_token。将找到这些词的所有组合。
示例 tag_token = ['1', '2', '3']
将匹配 1 2 3
和 1a 2 b 3
以及 2b2 1sss 3
和 333 2tt 1
import re
import itertools
string = 'machine 12 learning ml is a type of artificial intelligence ai that allows software applications to become more accurate at predicting outcomes without being explicitly programmed to do so machine12 learning algorithms use historical data as input to predict new output values machines learning is good. Learning machine can be used to train people. learning the machines is a great job'
tag_token=['machine','learning']
pattern='('
for current_tag in itertools.permutations(tag_token, len(tag_token)):
pattern+=''.join(e+'\w*\s+(?:\S*\s+)?' for e in current_tag)[:-14]+'|'
pattern=pattern.rstrip('|')+')'
rgx=re.compile(pattern,re.IGNORECASE)
rgx.findall(string)
#output
#['machine 12 learning', 'machine12 learning', 'machines learning', 'Learning machine', 'learning the machines']