使用 NLTK 的 Punkt Tokenizer 保留空行

Preserve empty lines with NLTK's Punkt Tokenizer

我正在使用 NLTK 的 PUNKT 句子分词器将文件拆分为句子列表,并希望保留文件中的空行:

from nltk import data
tokenizer = data.load('tokenizers/punkt/english.pickle')
s = "That was a very loud beep.\n\n I don't even know\n if this is working. Mark?\n\n Mark are you there?\n\n\n"
sentences = tokenizer.tokenize(s)
print sentences

我想要这样打印:

['That was a very loud beep.\n\n', "I don't even know\n if this is working.", 'Mark?\n\n', 'Mark are you there?\n\n\n']

但实际打印的内容显示,第一句和第三句的尾部空行已被删除:

['That was a very loud beep.', "I don't even know\n if this is working.", 'Mark?', 'Mark are you there?\n\n\n']

Other tokenizers 在 NLTK 中有一个 blanklines='keep' 参数,但在 Punkt 分词器的情况下我没有看到任何这样的选项。我很可能遗漏了一些简单的东西。有没有办法使用 Punkt 句子分词器重新训练这些尾随的空行?如果其他人可以提供任何见解,我将不胜感激!

问题

遗憾的是,您不能让分词器保留空白行,而不是按照它的编写方式。

Starting here 然后通过 span_tokenize() 和 _slices_from_text() 调用函数,您可以看到有一个条件

if match.group('next_tok'):

旨在确保标记器跳过空格,直到出现下一个可能的句子起始标记。查找 this 所指的正则表达式,我们最终查看 _period_context_fmt,我们看到 next_tok 命名组前面有 \s+,其中不会捕获空白行。

解决方案

分解它,更改您不喜欢的部分,重新​​组装您的自定义解决方案。

现在这个正则表达式在 PunktLanguageVars class, itself used to initialize the PunktSentenceTokenizer class 中。我们只需要从 PunktLanguageVars 派生一个自定义 class 并按照我们希望的方式修复正则表达式。

我们想要的修复是在句子末尾包含尾随换行符,所以我建议替换 _period_context_fmt,从这里开始:

_period_context_fmt = r"""
    \S*                          # some word material
    %(SentEndChars)s             # a potential sentence ending
    (?=(?P<after_tok>
        %(NonWord)s              # either other punctuation
        |
        \s+(?P<next_tok>\S+)     # or whitespace and some other token
    ))"""

对此:

_period_context_fmt = r"""
    \S*                          # some word material
    %(SentEndChars)s             # a potential sentence ending
    \s*                       #  <-- THIS is what I changed
    (?=(?P<after_tok>
        %(NonWord)s              # either other punctuation
        |
        (?P<next_tok>\S+)     #  <-- Normally you would have \s+ here
    ))"""

现在,使用此正则表达式而不是旧正则表达式的分词器将在句子结尾后包含 0 个或更多 \s 个字符。

整个脚本

import nltk.tokenize.punkt as pkt

class CustomLanguageVars(pkt.PunktLanguageVars):

    _period_context_fmt = r"""
        \S*                          # some word material
        %(SentEndChars)s             # a potential sentence ending
        \s*                       #  <-- THIS is what I changed
        (?=(?P<after_tok>
            %(NonWord)s              # either other punctuation
            |
            (?P<next_tok>\S+)     #  <-- Normally you would have \s+ here
        ))"""

custom_tknzr = pkt.PunktSentenceTokenizer(lang_vars=CustomLanguageVars())

s = "That was a very loud beep.\n\n I don't even know\n if this is working. Mark?\n\n Mark are you there?\n\n\n"

print(custom_tknzr.tokenize(s))

这输出:

['That was a very loud beep.\n\n ', "I don't even know\n if this is working. ", 'Mark?\n\n ', 'Mark are you there?\n\n\n']

我会选择 itertools.groupby,参见 Python: How to loop through blocks of lines:

alvas@ubi:~$ echo """This is a foo bar sentence,
that is also a foo bar sentence.

But I don't like foobars.
Yes you do like bars with foos, no?


I'm not sure whether you like bar bar!
Neither do I like black sheep.""" > test.in



alvas@ubi:~$ python
>>> from nltk import sent_tokenize
>>> import itertools
>>> with open('test.in', 'r') as fin:
...     for key, group in itertools.groupby(fin, lambda x: x!='\n'):
...             if key:
...                     print list(group)
... 
['This is a foo bar sentence,\n', 'that is also a foo bar sentence.\n']
["But I don't like foobars.\n", 'Yes you do like bars with foos, no?\n']
["I'm not sure whether you like bar bar!\n", 'Neither do I like black sheep.\n']

之后,如果你想做一个 sent_tokenize 或组内的其他 punkt 模型:

>>> with open('test.in', 'r') as fin:
...     for key, group in itertools.groupby(fin, lambda x: x!='\n'):
...             if key:
...                     paragraph = " ".join(line.strip() for line in group)
...                     print sent_tokenize(paragraph)
... 
['This is a foo bar sentence, that is also a foo bar sentence.']
["But I don't like foobars.", 'Yes you do like bars with foos, no?']
["I'm not sure whether you like bar bar!", 'Neither do I like black sheep.']

(注意:计算效率更高的方法是使用 mmap,请参阅 。但是对于我处理的大小(~2000 万个令牌)itertools.groupby 就足够了)

将输入拆分为段落,拆分捕获正则表达式(returns 捕获的字符串):

paras = re.split("(\n\s*\n)", sentences)

然后您可以将 nltk.sent_tokenize() 应用于各个段落,并按段落处理结果或展平列表 - 任何最适合您进一步使用的方法。

sents_by_para = [ nltk.sent_tokenize(p) for p in paras ]
flat = [ sent for par in sents_by_para for sent in par ]

(似乎 sent_tokenize() 不会破坏纯空白字符串,因此无需检查并将它们排除在处理之外。)

如果您特别想要将空格附加到前一句,您可以轻松地将其粘贴回去:

collapsed = []
for s in flat:
    if s.isspace() and len(collapsed) > 0:
        collapsed[-1] += s
    else:
        collapsed.append(s)

最后,我结合了@alexis 和@HugoMailhot 的见解,这样我就可以在单个段落有多个句子的情况下保留换行符 and/or 换行符:

import re, nltk, sys, codecs
import nltk.tokenize.punkt as pkt
from nltk import data

class CustomLanguageVars(pkt.PunktLanguageVars):

    _period_context_fmt = r"""
        \S*                          # some word material
        %(SentEndChars)s             # a potential sentence ending
        \s*                       #  <-- THIS is what I changed
        (?=(?P<after_tok>
            %(NonWord)s              # either other punctuation
            |
            (?P<next_tok>\S+)     #  <-- Normally you would have \s+ here
        ))"""

custom_tokenizer = pkt.PunktSentenceTokenizer(lang_vars=CustomLanguageVars())

def sentence_split(s):
        '''Read in a string and return a list of sentences with linebreaks intact'''
        paras = re.split("(\n\s*\n)", s)
        sents_by_para = [custom_tokenizer.tokenize(p) for p in paras ]
        flat = [ sent for par in sents_by_para for sent in par ]

        collapsed = []
        for s in flat:
            if s.isspace() and len(collapsed) > 0:
                collapsed[-1] += s
            else:
                collapsed.append(s)

        return collapsed

if __name__ == "__main__":
        s = codecs.open(sys.argv[1],'r','utf-8').read()
        sentences = sentence_split(s)