在 python 中提取每行的话语

Question

我有每行包含一个话语的文本数据。我想提取它，所以我有一个列表，其中包含具有相同行长的所有话语。

这是我的数据示例input.txt

I am very happy today.
Are you angry with me...? No?
Oh my dear, you look so beautiful.
Let's take a rest, I am so tired. 
Excuse me. This is my fault.

目前，我使用以下 python 代码：

from nltk import tokenize

utterances = []
with open('input.txt', 'r') as myfile:
    for line in myfile.readlines():
        utterance = tokenize.sent_tokenize(line)
        utterances = np.append(utterances, utterance)
utterances = list(utterances)
len(utterances)

它给出了话语的总数：7，它应该与输入数据相同。

我期待以下输出（5 个话语的列表），

['I am very happy today.', 'Are you angry to me...? No?', 'Oh my dear, you looks so beautiful.', "Let's take a rest, I am so tired.", 'Excuse me. This is my fault.']

虽然上面的 python 代码产生以下输出（7 个句子）。

['I am very happy today.', 'Are you angry to me...?', 'No?', 'Oh my dear, you look so beautiful.', "Let's take a rest, I am so tired.", 'Excuse me.', 'This is my fault.']

NLTK 有什么比 tokenize.sent_tokenize 更好的吗？我认为这是我得到错误结果的原因。

Answer 1

无需 np.append() 和 'sent_tokenize' 即可简单地附加到列表中

from nltk import tokenize

utterances = []
with open('input.txt', 'r') as myfile:
for line in myfile.readlines():
    utterance = line.strip('\n')
    utterances.append(utterance)
print(utterances)

Answer 2

这一行

utterance = tokenize.sent_tokenize(line)

您要求 nltk 将您的数据标记为句子，而不是话语。此函数考虑 ? 和 . 来标记句子的结尾。您的两行数据中有不止一个句子终止符，因此分词器将它们分别视为两个句子。这就是为什么你的结果包含 7 个句子（而不是你报告的 8 个）：第 2 行和第 5 行分别分成两个句子。

在 python 中提取每行的话语

Extracting utterance per line in python

python

text-processing

nltk