使用 nltk(不是正则表达式)提取 quotations/citations

Extracting quotations/citations with nltk (not regex)

输入的句子列表:

sentences = [
    """Well, I've tried to say "How Doth the Little Busy Bee," but it all came different!""",
    """Alice replied in a very melancholy voice. She continued, 'I'll try again.'"""
]

期望的输出:

How Doth the Little Busy Bee,
I'll try again.

有没有办法通过内置或第三方分词器 nltk 提取引文(可以出现在单引号和双引号中)?


我尝试使用 SExprTokenizer tokenizer 提供单引号和双引号作为 parens 值,但结果与预期相去甚远,例如:

In [1]: from nltk import SExprTokenizer
    ...: 
    ...: 
    ...: sentences = [
    ...:     """Well, I've tried to say "How Doth the Little Busy Bee," but it all came different!""",
    ...:     """Alice replied in a very melancholy voice. She continued, 'I'll try again.'"""
    ...: ]
    ...: 
    ...: tokenizer = SExprTokenizer(parens='""', strict=False)
    ...: for sentence in sentences:
    ...:     for item in tokenizer.tokenize(sentence):
    ...:         print(item)
    ...:     print("----")
    ...:     
Well,
I've
tried
to
say
"
How
Doth
the
Little
Busy
Bee,
"
 but it all came different!
----
Alice replied in a very melancholy voice. She continued, 'I'll try again.'

有类似 this and this 的帖子,但它们都建议使用基于正则表达式的方法,但是,我很好奇这是否可以仅用 nltk 解决 - 听起来很常见自然语言处理中的任务。

嗯,在幕后,SExprTokenizer 也是一种基于正则表达式的方法,从您链接到的源代码中可以看出。
从源码中还可以看出,作者显然没有考虑到开头和结尾的"paren"是用同一个字符表示的。 嵌套的深度在同一次迭代中增加和减少,因此分词器看到的引号是空字符串。

我认为识别引号在 NLP 中并不常见。 人们以多种不同的方式使用引号(尤其是当您处理不同的语言时……),因此很难以稳健的方式正确使用引号。 对于许多 NLP 应用程序,引用只是被忽略,我会说...