使用正则表达式查找段落中特定短语出现后的所有名词短语

Using a regular expression to find all noun phrases in a paragraph following the occurrence of a specific phrase

我有一个段落数据框,我(*可以)将其拆分为单词标记和句子标记,并希望在以下任何实例之后找到所有名词短语:"contribute to" 或 "donate to" 发生。

或者真的是某种形式,所以:

"Contributions are welcome to be made to the charity of your choice." 

---> would return: "the charity of your choice"

"blah blah blah donations, in honor of Firstname Lastname, can be made to ABC Foundation"

---> would return: "ABC Foundation"

我创建了一个正则表达式变通方法,它在大约 90% 的时间内捕获了正确的短语...见下文:

text = nltk.Text(nltk.word_tokenize(x))
donation = TokenSearcher(text).findall(r"<\.> <.*>{,15}? <donat.*|contrib.*> <.*>*? <to> (<.*>+?) <\.|\,|\;> ")
donation = [' '.join(tokens) for tokens in donation]
return donation

我想清理该正则表达式以摆脱“{,15}”要求,因为它缺少我需要的一些值。但是,我对 "greedy" 表达式不太熟悉,无法使其正常工作。

所以这个短语:

While she lived a full life , had many achievements and made many 
**contributions** , FirstName is remembered by most for her cheerful smile ,
colorful track suits , and beautiful necklaces hand made by daughter FirstName .
FirstName always cherished her annual visit home for Thanksgiving to visit
brother FirstName LastName

返回:"visit brother FirstName Lastname" 由于之前提到的贡献,即使 "to" 这个词在 15 个词之后出现。

(?:contrib|donat|gifts)(?=[^\.]+\bto\b[^\.]+).*to\s([^\.]+)

Example 如果有效并满足您的需求,请告诉我,我将解释我的正则表达式。

看来您正在为如何将搜索条件限制为单个句子而苦恼。因此,只需使用 NLTK 将您的文本分成句子(这比只查看句点要好得多),您的问题就会消失。

sents = nltk.sent_tokenize(x)  # `x` is a single string, as in your example
recipients = []
for sent in sents:
    m = re.search(r"\b(contrib|donat).*?\bto\b([^.,;]*)", sent)
    if m:
        recipients.append(m.group(2).strip())

对于进一步的工作,我还建议您使用比 Text 更好的工具,它用于简单的交互式探索。如果你确实想对你的文本做更多的事情,nltk 的 PlaintextCorpusReader 是你的朋友。