如何使用 NLTK nltk.tokenize.texttiling 将文本拆分为段落？

Question

我找到了这个 Split Text into paragraphs NLTK - usage of nltk.tokenize.texttiling? explaining how to feed a text into texttiling, however I am unable to actually return a text tokenized by paragraph / topic change as shown here under texttiling http://www.nltk.org/api/nltk.tokenize.html。

当我将我的文本输入文本拼接时，我得到相同的未标记化文本，但作为一个列表，这对我没有用。

    tt = nltk.tokenize.texttiling.TextTilingTokenizer(w=20, k=10,similarity_method=0, stopwords=None, smoothing_method=[0], smoothing_width=2, smoothing_rounds=1, cutoff_policy=1, demo_mode=False)

    tiles = tt.tokenize(text) # same text returned

我拥有的是遵循此基本结构的电子邮件

    From: X
    To: Y                             (LOGISTICS)
    Date: 10/03/2017

    Hello team,                       (INTRO)

    Some text here representing
    the body                          (BODY)
    of the text.

    Regards,                          (OUTRO)
    X

    *****DISCLAIMER*****              (POST EMAIL DISCLAIMER)
    THIS EMAIL IS CONFIDENTIAL
    IF YOU ARE NOT THE INTENDED RECIPIENT PLEASE DELETE THIS EMAIL

如果我们称这个电子邮件字符串为 s，它看起来像

    s = "From: X\nTo: Y\nDate: 10/03/2017 Hello team,\nSome text here representing the body of the text. Regards,\nX\n\n*****DISCLAIMER*****\nTHIS EMAIL IS CONFIDENTIAL\nIF YOU ARE NOT THE INTENDED RECIPIENT PLEASE DELETE THIS EMAIL"

我想做的是return这5个sections/paragraphs of string s - LOGISTICS, INTRO, BODY, OUTRO, POST EMAIL DISCLAIMER - separately 这样我就可以删除除正文之外的所有内容。我如何 return 这 5 个部分分别使用 nltk texttiling？

*** 并非所有电子邮件都遵循相同的结构或具有相同的措辞，因此我不能使用正则表达式。

Answer 1

使用 splitlines 怎么样？还是必须使用 nltk 包？

email = """    From: X
    To: Y                             (LOGISTICS)
    Date: 10/03/2017

    Hello team,                       (INTRO)

    Some text here representing
    the body                          (BODY)
    of the text.

    Regards,                          (OUTRO)
    X

    *****DISCLAIMER*****              (POST EMAIL DISCLAIMER)
    THIS EMAIL IS CONFIDENTIAL
    IF YOU ARE NOT THE INTENDED RECIPIENT PLEASE DELETE THIS EMAIL"""

y = [s.strip() for s in email.splitlines()]

print(y)

Answer 2

What I want to do is return these 5 sections / paragraphs of string s - LOGISTICS, INTRO, BODY, OUTRO, POST EMAIL DISCLAIMER - separately so I can remove everything but the BODY of the text. How can I return these 5 sections separately using nltk texttiling?

texttiling 算法 {1,4,5} 并非设计用于执行顺序文本分类 {2,3}（这是您描述的任务）。相反，从 http://people.ischool.berkeley.edu/~hearst/research/tiling.html:

TextTiling is [an unsupervised] technique for automatically subdividing texts into multi-paragraph units that represent passages, or subtopics.

参考文献：

{1} Marti A. Hearst，Multi-Paragraph 说明性文本的分割。 计算语言学协会第 32 届会议记录，新墨西哥州洛斯克鲁塞斯，1994 年 6 月。pdf
{2}李，J.Y。和 Dernoncourt, F.，2016 年 6 月。使用循环和卷积神经网络的顺序 Short-Text 分类。在计算语言学协会北美分会 2016 年会议记录中：人类语言技术（第 515-520 页）。 https://www.aclweb.org/anthology/N16-1062.pdf
{3} Dernoncourt、Franck、Ji Young Lee 和 Peter Szolovits。 “医学论文摘要中联合句子分类的神经网络”。在计算语言学协会欧洲分会第 15 届会议记录中：第 2 卷，短文，第 694-700 页。 2017. https://www.aclweb.org/anthology/E17-2110.pdf
{4} Hearst, M. TextTiling：将文本分割成 Multi-Paragraph 副主题段落, 计算语言学, 23 ( 1)，第 33-64 页，1997 年 3 月。pdf
{5} Pevzner, L. 和 Hearst, M.，文本分割评估指标的批判和改进，计算语言学, 28 (1), 2002 年 3 月，第 19-36 页。 pdf

如何使用 NLTK nltk.tokenize.texttiling 将文本拆分为段落？

How to split text into paragraphs using NLTK nltk.tokenize.texttiling?

python

tokenize

nltk

paragraph