如何使用 NLTK nltk.tokenize.texttiling 将文本拆分为段落?

How to split text into paragraphs using NLTK nltk.tokenize.texttiling?

我找到了这个 Split Text into paragraphs NLTK - usage of nltk.tokenize.texttiling? explaining how to feed a text into texttiling, however I am unable to actually return a text tokenized by paragraph / topic change as shown here under texttiling http://www.nltk.org/api/nltk.tokenize.html

当我将我的文本输入文本拼接时,我得到相同的未标记化文本,但作为一个列表,这对我没有用。

    tt = nltk.tokenize.texttiling.TextTilingTokenizer(w=20, k=10,similarity_method=0, stopwords=None, smoothing_method=[0], smoothing_width=2, smoothing_rounds=1, cutoff_policy=1, demo_mode=False)

    tiles = tt.tokenize(text) # same text returned

我拥有的是遵循此基本结构的电子邮件

    From: X
    To: Y                             (LOGISTICS)
    Date: 10/03/2017

    Hello team,                       (INTRO)

    Some text here representing
    the body                          (BODY)
    of the text.

    Regards,                          (OUTRO)
    X

    *****DISCLAIMER*****              (POST EMAIL DISCLAIMER)
    THIS EMAIL IS CONFIDENTIAL
    IF YOU ARE NOT THE INTENDED RECIPIENT PLEASE DELETE THIS EMAIL

如果我们称这个电子邮件字符串为 s,它看起来像

    s = "From: X\nTo: Y\nDate: 10/03/2017 Hello team,\nSome text here representing the body of the text. Regards,\nX\n\n*****DISCLAIMER*****\nTHIS EMAIL IS CONFIDENTIAL\nIF YOU ARE NOT THE INTENDED RECIPIENT PLEASE DELETE THIS EMAIL"

我想做的是return这5个sections/paragraphs of string s - LOGISTICS, INTRO, BODY, OUTRO, POST EMAIL DISCLAIMER - separately 这样我就可以删除除正文之外的所有内容。我如何 return 这 5 个部分分别使用 nltk texttiling?

*** 并非所有电子邮件都遵循相同的结构或具有相同的措辞,因此我不能使用正则表达式。

使用 splitlines 怎么样?还是必须使用 nltk 包?

email = """    From: X
    To: Y                             (LOGISTICS)
    Date: 10/03/2017

    Hello team,                       (INTRO)

    Some text here representing
    the body                          (BODY)
    of the text.

    Regards,                          (OUTRO)
    X

    *****DISCLAIMER*****              (POST EMAIL DISCLAIMER)
    THIS EMAIL IS CONFIDENTIAL
    IF YOU ARE NOT THE INTENDED RECIPIENT PLEASE DELETE THIS EMAIL"""

y = [s.strip() for s in email.splitlines()]

print(y)

What I want to do is return these 5 sections / paragraphs of string s - LOGISTICS, INTRO, BODY, OUTRO, POST EMAIL DISCLAIMER - separately so I can remove everything but the BODY of the text. How can I return these 5 sections separately using nltk texttiling?

texttiling 算法 {1,4,5} 并非设计用于执行顺序文本分类 {2,3}(这是您描述的任务)。相反,从 http://people.ischool.berkeley.edu/~hearst/research/tiling.html:

TextTiling is [an unsupervised] technique for automatically subdividing texts into multi-paragraph units that represent passages, or subtopics.


参考文献:

  • {1} Marti A. Hearst,Multi-Paragraph 说明性文本的分割计算语言学协会第 32 届会议记录,新墨西哥州洛斯克鲁塞斯,1994 年 6 月。pdf
  • {2}李,J.Y。和 Dernoncourt, F.,2016 年 6 月。使用循环和卷积神经网络的顺序 Short-Text 分类。在计算语言学协会北美分会 2016 年会议记录中:人类语言技术(第 515-520 页)。 https://www.aclweb.org/anthology/N16-1062.pdf
  • {3} Dernoncourt、Franck、Ji Young Lee 和 Peter Szolovits。 “医学论文摘要中联合句子分类的神经网络”。在计算语言学协会欧洲分会第 15 届会议记录中:第 2 卷,短文,第 694-700 页。 2017. https://www.aclweb.org/anthology/E17-2110.pdf
  • {4} Hearst, M. TextTiling:将文本分割成 Multi-Paragraph 副主题段落, 计算语言学, 23 ( 1),第 33-64 页,1997 年 3 月。pdf
  • {5} Pevzner, L. 和 Hearst, M.,文本分割评估指标的批判和改进计算语言学, 28 (1), 2002 年 3 月,第 19-36 页。 pdf