使用正则表达式查找段落中特定短语出现后的所有名词短语

Question

我有一个段落数据框，我（*可以）将其拆分为单词标记和句子标记，并希望在以下任何实例之后找到所有名词短语："contribute to" 或 "donate to" 发生。

或者真的是某种形式，所以：

"Contributions are welcome to be made to the charity of your choice." 

---> would return: "the charity of your choice"

和

"blah blah blah donations, in honor of Firstname Lastname, can be made to ABC Foundation"

---> would return: "ABC Foundation"

我创建了一个正则表达式变通方法，它在大约 90% 的时间内捕获了正确的短语...见下文：

text = nltk.Text(nltk.word_tokenize(x))
donation = TokenSearcher(text).findall(r"<\.> <.*>{,15}? <donat.*|contrib.*> <.*>*? <to> (<.*>+?) <\.|\,|\;> ")
donation = [' '.join(tokens) for tokens in donation]
return donation

我想清理该正则表达式以摆脱“{,15}”要求，因为它缺少我需要的一些值。但是，我对 "greedy" 表达式不太熟悉，无法使其正常工作。

所以这个短语：

While she lived a full life , had many achievements and made many 
**contributions** , FirstName is remembered by most for her cheerful smile ,
colorful track suits , and beautiful necklaces hand made by daughter FirstName .
FirstName always cherished her annual visit home for Thanksgiving to visit
brother FirstName LastName

返回："visit brother FirstName Lastname" 由于之前提到的贡献，即使 "to" 这个词在 15 个词之后出现。

Answer 1

(?:contrib|donat|gifts)(?=[^\.]+\bto\b[^\.]+).*to\s([^\.]+)

Example 如果有效并满足您的需求，请告诉我，我将解释我的正则表达式。

Answer 2

看来您正在为如何将搜索条件限制为单个句子而苦恼。因此，只需使用 NLTK 将您的文本分成句子（这比只查看句点要好得多），您的问题就会消失。

sents = nltk.sent_tokenize(x)  # `x` is a single string, as in your example
recipients = []
for sent in sents:
    m = re.search(r"\b(contrib|donat).*?\bto\b([^.,;]*)", sent)
    if m:
        recipients.append(m.group(2).strip())

对于进一步的工作，我还建议您使用比 Text 更好的工具，它用于简单的交互式探索。如果你确实想对你的文本做更多的事情，nltk 的 PlaintextCorpusReader 是你的朋友。

使用正则表达式查找段落中特定短语出现后的所有名词短语

Using a regular expression to find all noun phrases in a paragraph following the occurrence of a specific phrase

python

regex

nlp

nltk

findall