检测句子中引用的文本

Detection of quoted text in sentences

我有一些句子在其中引用文字,例如:

Why did the author use three sentences in a row that start with the words, "it spun"?
Why did the queen most likely say  “I would have tea instead.”
Why did the fdsfdsf repeat the phrase "he waited" so many times?
Why were "the lights of his town growing smaller below them"?
What is a fdsfdsf for the word "adjust"?
Reread this: "If anybody had asked trial of answered at once, 'My nose.'" What is the correct definition of the word "trial" as it is used here?
Reread these sentences: "This was his courtship, and it lasted all through the summer." What does the word "courtship" mean?

我试图用 REGEX 掩盖引用的部分,但它不准确。比如最后一句:

txt = 'Reread these sentences: "This was his courtship, and it lasted all through the summer." What does the word "courtship" mean?'
print(re.sub(r"(?<=\").{20,}(?=\")", "<quote>", txt))

输出为:

Reread these sentences: "<quote>" mean?

相反,它应该是:

Reread these sentences: "<quote>" What does the word "courtship" mean?

因为我有超过 10k 个实例,所以很难找到适用于所有情况的通用 REGEX 模式。

我的问题是,是否有任何库(可能是基于神经网络实现的?)或方法来解决这个问题?

对于这些示例,请使用

import re
txt = """Why did the author use three sentences in a row that start with the words, "it spun"?
Why did the queen most likely say  “I would have tea instead.”
Why did the fdsfdsf repeat the phrase "he waited" so many times?
Why were "the lights of his town growing smaller below them"?
What is a fdsfdsf for the word "adjust"?
Reread these sentences: "This was his courtship, and it lasted all through the summer." What does the word "courtship" mean?"""
txt = re.sub(r'''"([^"]*)"''', lambda m: '<quote>' if len(m.group(1))>19 else m.group(), txt)
txt = re.sub(r'“[^“”]{20,}”', '<quote>', txt)
print(txt)

参见Python proof。对于各种类型的报价,使用单独的命令,这样更容易控制。

结果:

Why did the author use three sentences in a row that start with the words, "it spun"?
Why did the queen most likely say  <quote>
Why did the fdsfdsf repeat the phrase "he waited" so many times?
Why were <quote>?
What is a fdsfdsf for the word "adjust"?
Reread these sentences: <quote> What does the word "courtship" mean?

另一种方法可能是使用完全不同于正则表达式的技术,shlex

The shlex class makes it easy to write lexical analyzers for simple syntaxes resembling that of the Unix shell. This will often be useful for writing minilanguages, (for example, in run control files for Python applications) or for parsing quoted strings.

shlex.split 在拆分为单词时考虑引号,可选的 posix 参数在结果中保留引号。使用它的输出,您可以创建一个与您描述的一样的字符串。

import shlex

lines = [
'Why did the author use three sentences in a row that start with the words, "it spun"?',
'Why did the queen most likely say  “I would have tea instead.”',
'Why did the fdsfdsf repeat the phrase "he waited" so many times?',
'Why were "the lights of his town growing smaller below them"?',
'What is a fdsfdsf for the word "adjust"?', 'Reread this: "If anybody had asked trial of answered at once, \'My nose.\'" What is the correct definition of the word "trial" as it is used here?',
'Reread these sentences: "This was his courtship, and it lasted all through the summer." What does the word "courtship" mean?',
]
for line in lines:
    print(
        " ".join(
            word
            if word[0] != '"' and word[-1] != '"' else '"<quote>"'
            for word in shlex.split(line, posix=False)
        )
    )

输出:

Why did the author use three sentences in a row that start with the words, "<quote>" ?
Why did the queen most likely say “I would have tea instead.”
Why did the fdsfdsf repeat the phrase "<quote>" so many times?
Why were "<quote>" ?
What is a fdsfdsf for the word "<quote>" ?
Reread this: "<quote>" What is the correct definition of the word "<quote>" as it is used here?
Reread these sentences: "<quote>" What does the word "<quote>" mean?
  • 注意 1:shlex 不会将弯引号解释为引号(例如第 2 行),因此如果您有弯引号,您应该 .replace() 在输入每一行之前使用它们。
  • 注意 2:这将替换所有引用的事件,但如果您只想要第一个并保留其余的,您可以改为这样做(很确定这可以写得更好,但将其作为概念证明):
for line in lines:
    new_line = []
    quote_count = 0
    for word in shlex.split(line, posix=False):
        if word[0] == '"' and word[-1] == '"':
            if quote_count < 1:
                quote_count += 1
                new_line.append('"<quote>"')
            else:
                new_line.append(word)
        else:
            new_line.append(word)
    print(' '.join(new_line))

输出:

Why did the author use three sentences in a row that start with the words, "<quote>" ?
Why did the queen most likely say “I would have tea instead.”
Why did the fdsfdsf repeat the phrase "<quote>" so many times?
Why were "<quote>" ?
What is a fdsfdsf for the word "<quote>" ?
Reread this: "<quote>" What is the correct definition of the word "trial" as it is used here?
Reread these sentences: "<quote>" What does the word "courtship" mean?