如何用 Python 提取当前句子和围绕特定单词的周围句子?
How extract the current sentence and surrounding sentences around a particular word with Python?
有没有办法让句子中的任何选定单词周围的句子。假设我们的目标是获取下面示例中包含单词“Champion”的当前句子以及围绕它的前后句子,而不管它们的位置、标签或单词 champion 重复了多少次。
text = "This is sentence 1. We are the champions. This is sentence 3. This is sentence 4. This is sentence 5. You are champions too."
在上面的例子中,单词 champion 在句子 2 和 6 中重复出现。所以我们想要发送 1,2,3,5,6 并排除发送 4。
我们如何使用 Spacy 或其他工具实现这一目标?
使用这个功能会给出周围的句子
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
def surrounding_sentences(text, word):
sentences=sent_tokenize(text)
my_sents=[]
for i in range(len(sentences)):
if word in word_tokenize(sentences[i].lower()):
if i-1>0 :
previous_sent = sentences[i-1]
my_sents.append(previous_sent)
else: pass
sent= sentences[i]
my_sents.append(sent)
if i+1 < len(sentences):
nextsent = sentences[i+1]
my_sents.append(nextsent)
else: pass
my_sents = list(set(my_sents))
return my_sents
您可以只使用 re.split
来拆分标点符号,确定哪些句子包含该词,然后抓取索引匹配或与该句子相邻的任何内容。
>>> import re
>>> text = "This is sentence 1. We are the champions. This is sentence 3. This is sentence 4. This is sentence 5. You are champions too."
>>> sentences = sentences = re.split('[\.\!\?] *',text)[:-1]
>>> sentences
['This is sentence 1', 'We are the champions', 'This is sentence 3', 'This is sentence 4', 'This is sentence 5', 'You are champions too']
>>> champion_indices = set(
[
i for i in range(len(sentences))
if 'champions' in sentences[i].casefold()]
)
>>> champion_indices
{1, 5}
>>> champion_adjacent_sentences = [
sentences[i] for i in range(len(sentences))
if (i - 1 in champion_indices
or i in champion_indices
or i+1 in champion_indices)]
>>> champion_adjacent_sentences
['This is sentence 1', 'We are the champions', 'This is sentence 3', 'This is sentence 5', 'You are champions too']
这里唯一可能不熟悉的是 casefold
的使用,这是一种将两个字符串小写以便对它们进行不区分大小写的比较的更巧妙的方法。
有没有办法让句子中的任何选定单词周围的句子。假设我们的目标是获取下面示例中包含单词“Champion”的当前句子以及围绕它的前后句子,而不管它们的位置、标签或单词 champion 重复了多少次。
text = "This is sentence 1. We are the champions. This is sentence 3. This is sentence 4. This is sentence 5. You are champions too."
在上面的例子中,单词 champion 在句子 2 和 6 中重复出现。所以我们想要发送 1,2,3,5,6 并排除发送 4。
我们如何使用 Spacy 或其他工具实现这一目标?
使用这个功能会给出周围的句子
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
def surrounding_sentences(text, word):
sentences=sent_tokenize(text)
my_sents=[]
for i in range(len(sentences)):
if word in word_tokenize(sentences[i].lower()):
if i-1>0 :
previous_sent = sentences[i-1]
my_sents.append(previous_sent)
else: pass
sent= sentences[i]
my_sents.append(sent)
if i+1 < len(sentences):
nextsent = sentences[i+1]
my_sents.append(nextsent)
else: pass
my_sents = list(set(my_sents))
return my_sents
您可以只使用 re.split
来拆分标点符号,确定哪些句子包含该词,然后抓取索引匹配或与该句子相邻的任何内容。
>>> import re
>>> text = "This is sentence 1. We are the champions. This is sentence 3. This is sentence 4. This is sentence 5. You are champions too."
>>> sentences = sentences = re.split('[\.\!\?] *',text)[:-1]
>>> sentences
['This is sentence 1', 'We are the champions', 'This is sentence 3', 'This is sentence 4', 'This is sentence 5', 'You are champions too']
>>> champion_indices = set(
[
i for i in range(len(sentences))
if 'champions' in sentences[i].casefold()]
)
>>> champion_indices
{1, 5}
>>> champion_adjacent_sentences = [
sentences[i] for i in range(len(sentences))
if (i - 1 in champion_indices
or i in champion_indices
or i+1 in champion_indices)]
>>> champion_adjacent_sentences
['This is sentence 1', 'We are the champions', 'This is sentence 3', 'This is sentence 5', 'You are champions too']
这里唯一可能不熟悉的是 casefold
的使用,这是一种将两个字符串小写以便对它们进行不区分大小写的比较的更巧妙的方法。