Python NLTK - 将段落标记为句子和单词
Python NLTK - Tokenize paragraphs into sentences and words
我在 .txt 文件中有一些段落文本。我正在尝试标记段落并将它们附加到句子和单词列表中。我不确定我做错了什么,因为我已经设法获得了句子,但没有得到单词。为此一直在用头撞墙!
输入:
This is sentence one,
Another sentence:
Third line.
期望的输出:
[
['This', 'is', 'sentence', 'one', ','],
['Another', 'sentence', ':'],
['Third', 'line', '.']
]
我的错误代码和输出:
from nltk.tokenize import sent_tokenize, word_tokenize
with open('file.txt') as file:
for line in file:
sentences.append(sent_tokenize(line))
for line in sentences:
words_token = [word_tokenize(i) for i in line]
sentences_split_into_words.append(words_token)
----Result----
[
[['This', 'is', 'sentence', 'one', ',']],
[['Another', 'sentence', ':']],
[['Third', 'line', '.']]
]
我也试过了,但是 returns 出错 'expected string or byte-like object':
for line in sentences:
sentences_split_into_words.append(word_tokenize(line))
试试这个代码
from nltk.tokenize import sent_tokenize, word_tokenize
with open('file.txt') as file:
for line in file:
sentences.append(sent_tokenize(line))
sentences_split_into_words = []
for line in sentences:
words_token = [word_tokenize(i) for i in line]
sentences_split_into_words.extend(words_token)
参考:https://www.programiz.com/python-programming/methods/list/extend
我在 .txt 文件中有一些段落文本。我正在尝试标记段落并将它们附加到句子和单词列表中。我不确定我做错了什么,因为我已经设法获得了句子,但没有得到单词。为此一直在用头撞墙!
输入:
This is sentence one,
Another sentence:
Third line.
期望的输出:
[
['This', 'is', 'sentence', 'one', ','],
['Another', 'sentence', ':'],
['Third', 'line', '.']
]
我的错误代码和输出:
from nltk.tokenize import sent_tokenize, word_tokenize
with open('file.txt') as file:
for line in file:
sentences.append(sent_tokenize(line))
for line in sentences:
words_token = [word_tokenize(i) for i in line]
sentences_split_into_words.append(words_token)
----Result----
[
[['This', 'is', 'sentence', 'one', ',']],
[['Another', 'sentence', ':']],
[['Third', 'line', '.']]
]
我也试过了,但是 returns 出错 'expected string or byte-like object':
for line in sentences:
sentences_split_into_words.append(word_tokenize(line))
试试这个代码
from nltk.tokenize import sent_tokenize, word_tokenize
with open('file.txt') as file:
for line in file:
sentences.append(sent_tokenize(line))
sentences_split_into_words = []
for line in sentences:
words_token = [word_tokenize(i) for i in line]
sentences_split_into_words.extend(words_token)
参考:https://www.programiz.com/python-programming/methods/list/extend