文本分析：提取人名和引语：如何创建模式

Question

我需要提取给定文本的名称和引号，例如：

Homer Simpson said: "Okay, here we go..."

返回值：

extracted_person_name - The extracted person name, as appearing in the patterns explained above

extracted_quotation - The extracted quoted text (withot the surrounding quotation marks).

Important Note: if the pattern is not found, return None values for both the
extracted person name and the extracted text.

您可能希望输入文本类似于以下模式：

Person name said: "quoted text"

上述模式的变体：

The colon punctuation mark (:) is optional, and and might not appear in the input sentence. Instead of the word said you could also expect the words:

answered, responded, replied

这是我目前得到的：

def person_quotation_pattern_extraction(raw_text):
    
    name_pattern = "\w+\s\w+"
    quote_pattern = "["]\w+["]"
    
    quote = re.search(quote_pattern, raw_text)
    name = re.search(name_pattern, raw_text)

    
    if re.search(quote_pattern,raw_text):
        extracted_quotation = quote.group(0) 
    else:
        extracted_quotation=None

    if re.search(name_pattern,raw_text):
        extracted_person_name = name.group(0)
    else:
        extracted_person_name=None

    return extracted_person_name, extracted_quotation

问题是returnsNull。我假设模式不正确，你能告诉我它们有什么问题吗？

Answer 1

第一个模式就可以了。它与“Homer Simpson”以及“here we”相匹配，但由于您只 return 第 0 组，所以这很好。

第二个模式有一些问题。由于您使用 " 打开字符串并在字符串中使用相同的 "，因此 python 认为字符串到此结束。您可以从字符的颜色从绿色（字符串）变为黑色（不是字符串）变回绿色来观察这一点。

quote_pattern = "["]\w+["]"

您可以通过用单引号 ' 开始（和结束）您的字符串来防止这种情况，如下所示：

quote_pattern ='["]\w+["]'

但是，这仍然与提供的报价不符。这是因为 \w 匹配任何单词字符（相当于 [a-zA-Z0-9_]）但不匹配逗号 ,，点 . 或空格。因此，您可以将模式更改为

quote_pattern ='["].*["]'

其中 .* 匹配任何内容。您可以通过删除方括号来进一步简化表达式。在这种情况下不需要它们，因为它们只包含一个元素。

quote_pattern ='".*"'

您需要 return 引用不带引号。因此，您可以使用 ():

在表达式中创建一个捕获组

quote_pattern ='"(.*)"'

这样仍然需要引号来匹配，但是会创建一个不包含引号的组。该组将有索引 1 而不是您目前使用的 0：

extracted_quotation = quote.group(1)

这应该会产生预期的结果。

查看此网站以了解一些交互式正则表达式操作：https://regex101.com/

文本分析：提取人名和引语：如何创建模式

Text Analysis: extracting person name and quotation: how to create a pattern

python

nlp

pandas

data-science

python-re