如何使用 NLTK nltk.tokenize.texttiling 将文本拆分为段落?
How to split text into paragraphs using NLTK nltk.tokenize.texttiling?
我找到了这个 Split Text into paragraphs NLTK - usage of nltk.tokenize.texttiling? explaining how to feed a text into texttiling, however I am unable to actually return a text tokenized by paragraph / topic change as shown here under texttiling http://www.nltk.org/api/nltk.tokenize.html。
当我将我的文本输入文本拼接时,我得到相同的未标记化文本,但作为一个列表,这对我没有用。
tt = nltk.tokenize.texttiling.TextTilingTokenizer(w=20, k=10,similarity_method=0, stopwords=None, smoothing_method=[0], smoothing_width=2, smoothing_rounds=1, cutoff_policy=1, demo_mode=False)
tiles = tt.tokenize(text) # same text returned
我拥有的是遵循此基本结构的电子邮件
From: X
To: Y (LOGISTICS)
Date: 10/03/2017
Hello team, (INTRO)
Some text here representing
the body (BODY)
of the text.
Regards, (OUTRO)
X
*****DISCLAIMER***** (POST EMAIL DISCLAIMER)
THIS EMAIL IS CONFIDENTIAL
IF YOU ARE NOT THE INTENDED RECIPIENT PLEASE DELETE THIS EMAIL
如果我们称这个电子邮件字符串为 s,它看起来像
s = "From: X\nTo: Y\nDate: 10/03/2017 Hello team,\nSome text here representing the body of the text. Regards,\nX\n\n*****DISCLAIMER*****\nTHIS EMAIL IS CONFIDENTIAL\nIF YOU ARE NOT THE INTENDED RECIPIENT PLEASE DELETE THIS EMAIL"
我想做的是return这5个sections/paragraphs of string s - LOGISTICS, INTRO, BODY, OUTRO, POST EMAIL DISCLAIMER - separately 这样我就可以删除除正文之外的所有内容。我如何 return 这 5 个部分分别使用 nltk texttiling?
*** 并非所有电子邮件都遵循相同的结构或具有相同的措辞,因此我不能使用正则表达式。
使用 splitlines
怎么样?还是必须使用 nltk 包?
email = """ From: X
To: Y (LOGISTICS)
Date: 10/03/2017
Hello team, (INTRO)
Some text here representing
the body (BODY)
of the text.
Regards, (OUTRO)
X
*****DISCLAIMER***** (POST EMAIL DISCLAIMER)
THIS EMAIL IS CONFIDENTIAL
IF YOU ARE NOT THE INTENDED RECIPIENT PLEASE DELETE THIS EMAIL"""
y = [s.strip() for s in email.splitlines()]
print(y)
What I want to do is return these 5 sections / paragraphs of string s - LOGISTICS, INTRO, BODY, OUTRO, POST EMAIL DISCLAIMER - separately so I can remove everything but the BODY of the text. How can I return these 5 sections separately using nltk texttiling?
texttiling 算法 {1,4,5} 并非设计用于执行顺序文本分类 {2,3}(这是您描述的任务)。相反,从 http://people.ischool.berkeley.edu/~hearst/research/tiling.html:
TextTiling is [an unsupervised] technique for automatically subdividing texts into multi-paragraph units that represent passages, or subtopics.
参考文献:
- {1} Marti A. Hearst,Multi-Paragraph 说明性文本的分割。 计算语言学协会第 32 届会议记录,新墨西哥州洛斯克鲁塞斯,1994 年 6 月。pdf
- {2}李,J.Y。和 Dernoncourt, F.,2016 年 6 月。使用循环和卷积神经网络的顺序 Short-Text 分类。在计算语言学协会北美分会 2016 年会议记录中:人类语言技术(第 515-520 页)。 https://www.aclweb.org/anthology/N16-1062.pdf
- {3} Dernoncourt、Franck、Ji Young Lee 和 Peter Szolovits。 “医学论文摘要中联合句子分类的神经网络”。在计算语言学协会欧洲分会第 15 届会议记录中:第 2 卷,短文,第 694-700 页。 2017. https://www.aclweb.org/anthology/E17-2110.pdf
- {4} Hearst, M. TextTiling:将文本分割成 Multi-Paragraph 副主题段落, 计算语言学, 23 ( 1),第 33-64 页,1997 年 3 月。pdf
- {5} Pevzner, L. 和 Hearst, M.,文本分割评估指标的批判和改进,计算语言学, 28 (1), 2002 年 3 月,第 19-36 页。 pdf
我找到了这个 Split Text into paragraphs NLTK - usage of nltk.tokenize.texttiling? explaining how to feed a text into texttiling, however I am unable to actually return a text tokenized by paragraph / topic change as shown here under texttiling http://www.nltk.org/api/nltk.tokenize.html。
当我将我的文本输入文本拼接时,我得到相同的未标记化文本,但作为一个列表,这对我没有用。
tt = nltk.tokenize.texttiling.TextTilingTokenizer(w=20, k=10,similarity_method=0, stopwords=None, smoothing_method=[0], smoothing_width=2, smoothing_rounds=1, cutoff_policy=1, demo_mode=False)
tiles = tt.tokenize(text) # same text returned
我拥有的是遵循此基本结构的电子邮件
From: X
To: Y (LOGISTICS)
Date: 10/03/2017
Hello team, (INTRO)
Some text here representing
the body (BODY)
of the text.
Regards, (OUTRO)
X
*****DISCLAIMER***** (POST EMAIL DISCLAIMER)
THIS EMAIL IS CONFIDENTIAL
IF YOU ARE NOT THE INTENDED RECIPIENT PLEASE DELETE THIS EMAIL
如果我们称这个电子邮件字符串为 s,它看起来像
s = "From: X\nTo: Y\nDate: 10/03/2017 Hello team,\nSome text here representing the body of the text. Regards,\nX\n\n*****DISCLAIMER*****\nTHIS EMAIL IS CONFIDENTIAL\nIF YOU ARE NOT THE INTENDED RECIPIENT PLEASE DELETE THIS EMAIL"
我想做的是return这5个sections/paragraphs of string s - LOGISTICS, INTRO, BODY, OUTRO, POST EMAIL DISCLAIMER - separately 这样我就可以删除除正文之外的所有内容。我如何 return 这 5 个部分分别使用 nltk texttiling?
*** 并非所有电子邮件都遵循相同的结构或具有相同的措辞,因此我不能使用正则表达式。
使用 splitlines
怎么样?还是必须使用 nltk 包?
email = """ From: X
To: Y (LOGISTICS)
Date: 10/03/2017
Hello team, (INTRO)
Some text here representing
the body (BODY)
of the text.
Regards, (OUTRO)
X
*****DISCLAIMER***** (POST EMAIL DISCLAIMER)
THIS EMAIL IS CONFIDENTIAL
IF YOU ARE NOT THE INTENDED RECIPIENT PLEASE DELETE THIS EMAIL"""
y = [s.strip() for s in email.splitlines()]
print(y)
What I want to do is return these 5 sections / paragraphs of string s - LOGISTICS, INTRO, BODY, OUTRO, POST EMAIL DISCLAIMER - separately so I can remove everything but the BODY of the text. How can I return these 5 sections separately using nltk texttiling?
texttiling 算法 {1,4,5} 并非设计用于执行顺序文本分类 {2,3}(这是您描述的任务)。相反,从 http://people.ischool.berkeley.edu/~hearst/research/tiling.html:
TextTiling is [an unsupervised] technique for automatically subdividing texts into multi-paragraph units that represent passages, or subtopics.
参考文献:
- {1} Marti A. Hearst,Multi-Paragraph 说明性文本的分割。 计算语言学协会第 32 届会议记录,新墨西哥州洛斯克鲁塞斯,1994 年 6 月。pdf
- {2}李,J.Y。和 Dernoncourt, F.,2016 年 6 月。使用循环和卷积神经网络的顺序 Short-Text 分类。在计算语言学协会北美分会 2016 年会议记录中:人类语言技术(第 515-520 页)。 https://www.aclweb.org/anthology/N16-1062.pdf
- {3} Dernoncourt、Franck、Ji Young Lee 和 Peter Szolovits。 “医学论文摘要中联合句子分类的神经网络”。在计算语言学协会欧洲分会第 15 届会议记录中:第 2 卷,短文,第 694-700 页。 2017. https://www.aclweb.org/anthology/E17-2110.pdf
- {4} Hearst, M. TextTiling:将文本分割成 Multi-Paragraph 副主题段落, 计算语言学, 23 ( 1),第 33-64 页,1997 年 3 月。pdf
- {5} Pevzner, L. 和 Hearst, M.,文本分割评估指标的批判和改进,计算语言学, 28 (1), 2002 年 3 月,第 19-36 页。 pdf