return 如何匹配多个文本列表?
How return more than one match on a list of text?
我目前有一个函数可以生成一个术语及其所在的句子。此时,该函数仅从术语列表中检索第一个匹配项。我希望能够检索所有匹配项,而不仅仅是第一个。
例如,list_of_matches = ["heart attack", "cardiovascular", "hypoxia"]
一个句子是 text_list = ["A heart attack is a result of cardiovascular...", "Chronic intermittent hypoxia is the..."]
理想的输出是:
['heart attack', 'a heart attack is a result of cardiovascular...'],
['cardiovascular', 'a heart attack is a result of cardiovascular...'],
['hypoxia', 'chronic intermittent hypoxia is the...']
# this is the current function
def find_word(list_of_matches, line):
for words in list_of_matches:
if any([words in line]):
return words, line
# returns list of 'term, matched string'
key_vals = [list(find_word(list_of_matches, line.lower())) for line in text_list if
find_word(list_of_matches, line.lower()) != None]
# output is currently
['heart attack', 'a heart attack is a result of cardiovascular...'],
['hypoxia', 'chronic intermittent hypoxia is the...']
您将要在这里使用正则表达式。
import re
def find_all_matches(words_to_search, text):
matches = []
for word in words_to_search:
matched_text = re.search(word, text).group()
matches.append(matched_text)
return [matches, text]
请注意,这将 return 一个包含所有匹配项的嵌套列表。
解决方案需要 2 个步骤:
- 修复功能
- 处理输出
鉴于您不希望的输出遵循模式
output = [
[word1, sentence1],
[word2, sentence1],
[word3, sentence2],
]
- 修复函数:
你应该在 'for' 循环中更改 de return 以迭代 list_of_matches 的每个单词,以获得匹配的所有单词,而不仅仅是第一个
。它应该保持这样:
def find_word(list_of_matches, line):
answer = []
for words in list_of_matches:
if any([words in line]):
answer.append([words, line])
return answer
使用上面的函数,输出将是:
key_vals = [
[
['heart attack', 'a heart attack is a result of cardiovascular...'],
['cardiovascular', 'a heart attack is a result of cardiovascular...']
],
[
['hypoxia', 'chronic intermittent hypoxia is the...']
]
]
- 处理输出: 现在您需要获取变量“key_vals”并处理使用以下代码处理的每个句子的所有列表列表:
output = []
for word_sentence_list in key_vals:
for word_sentence in word_sentence_list:
output.append(word_sentence)
最后,输出将是:
output = [
['heart attack', 'a heart attack is a result of cardiovascular...'],
['cardiovascular', 'a heart attack is a result of cardiovascular...'],
['hypoxia', 'chronic intermittent hypoxia is the...']
]
我目前有一个函数可以生成一个术语及其所在的句子。此时,该函数仅从术语列表中检索第一个匹配项。我希望能够检索所有匹配项,而不仅仅是第一个。
例如,list_of_matches = ["heart attack", "cardiovascular", "hypoxia"]
一个句子是 text_list = ["A heart attack is a result of cardiovascular...", "Chronic intermittent hypoxia is the..."]
理想的输出是:
['heart attack', 'a heart attack is a result of cardiovascular...'],
['cardiovascular', 'a heart attack is a result of cardiovascular...'],
['hypoxia', 'chronic intermittent hypoxia is the...']
# this is the current function
def find_word(list_of_matches, line):
for words in list_of_matches:
if any([words in line]):
return words, line
# returns list of 'term, matched string'
key_vals = [list(find_word(list_of_matches, line.lower())) for line in text_list if
find_word(list_of_matches, line.lower()) != None]
# output is currently
['heart attack', 'a heart attack is a result of cardiovascular...'],
['hypoxia', 'chronic intermittent hypoxia is the...']
您将要在这里使用正则表达式。
import re
def find_all_matches(words_to_search, text):
matches = []
for word in words_to_search:
matched_text = re.search(word, text).group()
matches.append(matched_text)
return [matches, text]
请注意,这将 return 一个包含所有匹配项的嵌套列表。
解决方案需要 2 个步骤:
- 修复功能
- 处理输出
鉴于您不希望的输出遵循模式
output = [ [word1, sentence1], [word2, sentence1], [word3, sentence2], ]
- 修复函数: 你应该在 'for' 循环中更改 de return 以迭代 list_of_matches 的每个单词,以获得匹配的所有单词,而不仅仅是第一个
。它应该保持这样:
def find_word(list_of_matches, line): answer = [] for words in list_of_matches: if any([words in line]): answer.append([words, line]) return answer
使用上面的函数,输出将是:
key_vals = [ [ ['heart attack', 'a heart attack is a result of cardiovascular...'], ['cardiovascular', 'a heart attack is a result of cardiovascular...'] ], [ ['hypoxia', 'chronic intermittent hypoxia is the...'] ] ]
- 处理输出: 现在您需要获取变量“key_vals”并处理使用以下代码处理的每个句子的所有列表列表:
output = [] for word_sentence_list in key_vals: for word_sentence in word_sentence_list: output.append(word_sentence)
最后,输出将是:
output = [ ['heart attack', 'a heart attack is a result of cardiovascular...'], ['cardiovascular', 'a heart attack is a result of cardiovascular...'], ['hypoxia', 'chronic intermittent hypoxia is the...'] ]