使用 nltk（不是正则表达式）提取 quotations/citations

Question

输入的句子列表：

sentences = [
    """Well, I've tried to say "How Doth the Little Busy Bee," but it all came different!""",
    """Alice replied in a very melancholy voice. She continued, 'I'll try again.'"""
]

期望的输出：

How Doth the Little Busy Bee,
I'll try again.

有没有办法通过内置或第三方分词器 nltk 提取引文（可以出现在单引号和双引号中）？

我尝试使用 SExprTokenizer tokenizer 提供单引号和双引号作为 parens 值，但结果与预期相去甚远，例如：

In [1]: from nltk import SExprTokenizer
    ...: 
    ...: 
    ...: sentences = [
    ...:     """Well, I've tried to say "How Doth the Little Busy Bee," but it all came different!""",
    ...:     """Alice replied in a very melancholy voice. She continued, 'I'll try again.'"""
    ...: ]
    ...: 
    ...: tokenizer = SExprTokenizer(parens='""', strict=False)
    ...: for sentence in sentences:
    ...:     for item in tokenizer.tokenize(sentence):
    ...:         print(item)
    ...:     print("----")
    ...:     
Well,
I've
tried
to
say
"
How
Doth
the
Little
Busy
Bee,
"
 but it all came different!
----
Alice replied in a very melancholy voice. She continued, 'I'll try again.'

有类似 this and this 的帖子，但它们都建议使用基于正则表达式的方法，但是，我很好奇这是否可以仅用 nltk 解决 - 听起来很常见自然语言处理中的任务。

Answer 1

嗯，在幕后，SExprTokenizer 也是一种基于正则表达式的方法，从您链接到的源代码中可以看出。
从源码中还可以看出，作者显然没有考虑到开头和结尾的"paren"是用同一个字符表示的。嵌套的深度在同一次迭代中增加和减少，因此分词器看到的引号是空字符串。

我认为识别引号在 NLP 中并不常见。人们以多种不同的方式使用引号（尤其是当您处理不同的语言时……），因此很难以稳健的方式正确使用引号。对于许多 NLP 应用程序，引用只是被忽略，我会说...

使用 nltk（不是正则表达式）提取 quotations/citations

Extracting quotations/citations with nltk (not regex)

python

tokenize

nltk