Python NLTK - 将段落标记为句子和单词

Question

我在 .txt 文件中有一些段落文本。我正在尝试标记段落并将它们附加到句子和单词列表中。我不确定我做错了什么，因为我已经设法获得了句子，但没有得到单词。为此一直在用头撞墙！

输入：

This is sentence one,
Another sentence:
Third line.

期望的输出：

[
 ['This', 'is', 'sentence', 'one', ','],
 ['Another', 'sentence', ':'],
 ['Third', 'line', '.']
]

我的错误代码和输出：

from nltk.tokenize import sent_tokenize, word_tokenize
with open('file.txt') as file:
    for line in file:
        sentences.append(sent_tokenize(line))

for line in sentences:
    words_token = [word_tokenize(i) for i in line]
    sentences_split_into_words.append(words_token)

----Result----
    [
     [['This', 'is', 'sentence', 'one', ',']],
     [['Another', 'sentence', ':']],
     [['Third', 'line', '.']]
    ]

我也试过了，但是 returns 出错 'expected string or byte-like object':

for line in sentences:
    sentences_split_into_words.append(word_tokenize(line))

Answer 1

试试这个代码

from nltk.tokenize import sent_tokenize, word_tokenize
with open('file.txt') as file:
    for line in file:
        sentences.append(sent_tokenize(line))
sentences_split_into_words = []
for line in sentences:
    words_token = [word_tokenize(i) for i in line]
    sentences_split_into_words.extend(words_token)

参考：https://www.programiz.com/python-programming/methods/list/extend

Python NLTK - 将段落标记为句子和单词

Python NLTK - Tokenize paragraphs into sentences and words

python

nltk