使用 Python 根据单词查找提取三个句子

Question

我正在 python 中处理文本挖掘用例。这些是感兴趣的句子：

As a result may continue to be adversely impacted, by fluctuations in foreign currency exchange rates. Certain events such as the threat of additional tariffs on imported consumer goods from China, have increased. Stores are primarily located in shopping malls and other shopping centers.

如何提取关键字"China"的句子？我确实需要前后两句话，实际上至少前后两句话。

我试过下面的方法，回答是here：

import nltk
from nltk.tokenize import word_tokenize
sents = nltk.sent_tokenize(text)
my_sentences = [sent for sent in sents if 'China' in word_tokenize(sent)]

请帮忙！

Answer 1

TL;DR

使用 sent_tokenize，跟踪焦点词和 window 句子所在的索引以获得所需的结果。

from itertools import chain
from nltk import sent_tokenize, word_tokenize
from nltk.tokenize.treebank import TreebankWordDetokenizer

word_detokenize = TreebankWordDetokenizer().detokenize

text = """As a result may continue to be adversely impacted, by fluctuations in foreign currency exchange rates. Certain events such as the threat of additional tariffs on imported consumer goods from China, have increased global economic and political uncertainty and caused volatility in foreign currency exchange rates. Stores are primarily located in shopping malls and other shopping centers, certain of which have been experiencing declines in customer traffic."""

tokenized_text = [word_tokenize(sent) for sent in sent_tokenize(text)]

sent_idx_with_china = [idx for idx, sent in enumerate(tokenized_text) 
                       if 'China' in sent or 'china' in sent]

window = 2 # If you want 2 sentences before and after.

for idx in sent_idx_with_china:
    start = max(idx - window, 0)
    end = min(idx+window, len(tokenized_text))
    result = ' '.join(word_detokenize(sent) for sent in tokenized_text[start:end])
    print(result)

再举个例子，pip install wikipedia先：

from itertools import chain
from nltk import sent_tokenize, word_tokenize
from nltk.tokenize.treebank import TreebankWordDetokenizer

word_detokenize = TreebankWordDetokenizer().detokenize

import wikipedia

text =  wikipedia.page("Winnie The Pooh").content

tokenized_text = [word_tokenize(sent) for sent in sent_tokenize(text)]

sent_idx_with_china = [idx for idx, sent in enumerate(tokenized_text) 
                       if 'China' in sent or 'china' in sent]

window = 2 # If you want 2 sentences before and after.

for idx in sent_idx_with_china:
    start = max(idx - window, 0)
    end = min(idx+window, len(tokenized_text))
    result = ' '.join(word_detokenize(sent) for sent in tokenized_text[start:end])
    print(result)
    print()

[出]:

Ashdown Forest in England where the Pooh stories are set is a popular tourist attraction, and includes the wooden Pooh Bridge where Pooh and Piglet invented Poohsticks. The Oxford University Winnie the Pooh Society was founded by undergraduates in 1982. == Censorship in China == In the People's Republic of China, images of Pooh were censored in mid-2017 from social media websites, when internet memes comparing Chinese president Xi Jinping to Pooh became popular. The 2018 film Christopher Robin was also denied a Chinese release.

使用 Python 根据单词查找提取三个句子

Use Python to extract three sentences based on word finding

regex

nltk

python-3.x

text-segmentation

TL;DR