使用 nltk(不是正则表达式)提取 quotations/citations
Extracting quotations/citations with nltk (not regex)
输入的句子列表:
sentences = [
"""Well, I've tried to say "How Doth the Little Busy Bee," but it all came different!""",
"""Alice replied in a very melancholy voice. She continued, 'I'll try again.'"""
]
期望的输出:
How Doth the Little Busy Bee,
I'll try again.
有没有办法通过内置或第三方分词器 nltk
提取引文(可以出现在单引号和双引号中)?
我尝试使用 SExprTokenizer
tokenizer 提供单引号和双引号作为 parens
值,但结果与预期相去甚远,例如:
In [1]: from nltk import SExprTokenizer
...:
...:
...: sentences = [
...: """Well, I've tried to say "How Doth the Little Busy Bee," but it all came different!""",
...: """Alice replied in a very melancholy voice. She continued, 'I'll try again.'"""
...: ]
...:
...: tokenizer = SExprTokenizer(parens='""', strict=False)
...: for sentence in sentences:
...: for item in tokenizer.tokenize(sentence):
...: print(item)
...: print("----")
...:
Well,
I've
tried
to
say
"
How
Doth
the
Little
Busy
Bee,
"
but it all came different!
----
Alice replied in a very melancholy voice. She continued, 'I'll try again.'
有类似 this and this 的帖子,但它们都建议使用基于正则表达式的方法,但是,我很好奇这是否可以仅用 nltk
解决 - 听起来很常见自然语言处理中的任务。
嗯,在幕后,SExprTokenizer
也是一种基于正则表达式的方法,从您链接到的源代码中可以看出。
从源码中还可以看出,作者显然没有考虑到开头和结尾的"paren"是用同一个字符表示的。
嵌套的深度在同一次迭代中增加和减少,因此分词器看到的引号是空字符串。
我认为识别引号在 NLP 中并不常见。
人们以多种不同的方式使用引号(尤其是当您处理不同的语言时……),因此很难以稳健的方式正确使用引号。
对于许多 NLP 应用程序,引用只是被忽略,我会说...
输入的句子列表:
sentences = [
"""Well, I've tried to say "How Doth the Little Busy Bee," but it all came different!""",
"""Alice replied in a very melancholy voice. She continued, 'I'll try again.'"""
]
期望的输出:
How Doth the Little Busy Bee,
I'll try again.
有没有办法通过内置或第三方分词器 nltk
提取引文(可以出现在单引号和双引号中)?
我尝试使用 SExprTokenizer
tokenizer 提供单引号和双引号作为 parens
值,但结果与预期相去甚远,例如:
In [1]: from nltk import SExprTokenizer
...:
...:
...: sentences = [
...: """Well, I've tried to say "How Doth the Little Busy Bee," but it all came different!""",
...: """Alice replied in a very melancholy voice. She continued, 'I'll try again.'"""
...: ]
...:
...: tokenizer = SExprTokenizer(parens='""', strict=False)
...: for sentence in sentences:
...: for item in tokenizer.tokenize(sentence):
...: print(item)
...: print("----")
...:
Well,
I've
tried
to
say
"
How
Doth
the
Little
Busy
Bee,
"
but it all came different!
----
Alice replied in a very melancholy voice. She continued, 'I'll try again.'
有类似 this and this 的帖子,但它们都建议使用基于正则表达式的方法,但是,我很好奇这是否可以仅用 nltk
解决 - 听起来很常见自然语言处理中的任务。
嗯,在幕后,SExprTokenizer
也是一种基于正则表达式的方法,从您链接到的源代码中可以看出。
从源码中还可以看出,作者显然没有考虑到开头和结尾的"paren"是用同一个字符表示的。
嵌套的深度在同一次迭代中增加和减少,因此分词器看到的引号是空字符串。
我认为识别引号在 NLP 中并不常见。 人们以多种不同的方式使用引号(尤其是当您处理不同的语言时……),因此很难以稳健的方式正确使用引号。 对于许多 NLP 应用程序,引用只是被忽略,我会说...