从句子列表创建标记是返回字符而不是单词
Creating tokens from list of sentences is returning characters instead of words
from nltk.tokenize import sent_tokenize
text = open(path).read().lower().decode("utf8")
sent_tokenize_list = sent_tokenize(text)
tokens = [w for w in itertools.chain(*[sent for sent in sent_tokenize_list])]
最后一行,"tokens",returns个字符代替单词。
为什么会这样,我如何将它变成 return 个单词?特别是考虑基于句子列表来做。
因为 sent_tokenize
returns a list of string sentences and itertools.chain
将可迭代链接到单个可迭代 returning 项目,一次一个,直到它们耗尽。实际上,您已将句子重新组合为单个字符串并在列表理解中对其进行迭代。
要从句子列表中创建单个单词列表,例如您可以拆分和展平:
tokens = [word for sent in sent_tokenize_list for word in sent.split()]
这不处理标点符号,但您最初的尝试也不会。您的原件也适用于 split:
tokens = [w for w in itertools.chain(*(sent.split()
for sent in sent_tokenize_list))]
请注意,您可以使用生成器表达式而不是列表理解作为解包的参数。更好的是,使用 chain.from_iterable
:
tokens = [w for w in itertools.chain.from_iterable(
sent.split() for sent in sent_tokenize_list)]
对于标点符号处理,请使用 nltk.tokenize.word_tokenize
而不是 str.split
。它将 return 单词和标点符号作为单独的项目,并拆分例如 I's
到 I
和 's
(这当然是一件好事,因为它们实际上是分开的话,只是简写)。
也许你应该使用 word_tokenize
而不是 sent_tokenize
?
from nltk.tokenize import word_tokenize
text = open(path).read().lower().decode("utf8")
tokens = word_tokenize(text)
http://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.word_tokenize
首先,如果文件在'utf8'中,而你使用的是Python2,那么在io.open()
中使用encoding='utf8'
参数会更好:
import io
from nltk import word_tokenize, sent_tokenize
with io.open('file.txt', 'r', encoding='utf8') as fin:
document = []
for line in fin:
tokens += [word_tokenize(sent) for sent in sent_tokenize(line)]
如果是 Python3,只需执行:
from nltk import word_tokenize
with open('file.txt', 'r') as fin:
document = []
for line in fin:
tokens += [word_tokenize(sent) for sent in sent_tokenize(line)]
看看http://nedbatchelder.com/text/unipain.html
关于标记化,如果我们假设每一行包含某种可能由一个或多个句子组成的段落,我们想首先初始化一个列表来存储整个文档:
document = []
然后我们遍历行并将行拆分为句子:
for line in fin:
sentences = sent_tokenize(line)
然后我们将句子分成标记:
token = [word_tokenize(sent) for sent in sent_tokenize(line)]
由于我们要更新我们的文档列表来存储标记化的句子,我们使用:
document = []
for line in fin:
tokens += [word_tokenize(sent) for sent in sent_tokenize(line)]
不推荐!!!(但仍然可以在一行中):
alvas@ubi:~$ cat file.txt
this is a paragph. with many sentences.
yes, hahaah.. wahahha...
alvas@ubi:~$ python
Python 2.7.11+ (default, Apr 17 2016, 14:00:29)
[GCC 5.3.1 20160413] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import io
>>> from itertools import chain
>>> from nltk import sent_tokenize, word_tokenize
>>> list(chain(*[[word_tokenize(sent) for sent in sent_tokenize(line)] for line in io.open('file.txt', 'r', encoding='utf8')]))
[[u'this', u'is', u'a', u'paragph', u'.'], [u'with', u'many', u'sentences', u'.'], [u'yes', u',', u'hahaah..', u'wahahha', u'...']]
from nltk.tokenize import sent_tokenize
text = open(path).read().lower().decode("utf8")
sent_tokenize_list = sent_tokenize(text)
tokens = [w for w in itertools.chain(*[sent for sent in sent_tokenize_list])]
最后一行,"tokens",returns个字符代替单词。
为什么会这样,我如何将它变成 return 个单词?特别是考虑基于句子列表来做。
因为 sent_tokenize
returns a list of string sentences and itertools.chain
将可迭代链接到单个可迭代 returning 项目,一次一个,直到它们耗尽。实际上,您已将句子重新组合为单个字符串并在列表理解中对其进行迭代。
要从句子列表中创建单个单词列表,例如您可以拆分和展平:
tokens = [word for sent in sent_tokenize_list for word in sent.split()]
这不处理标点符号,但您最初的尝试也不会。您的原件也适用于 split:
tokens = [w for w in itertools.chain(*(sent.split()
for sent in sent_tokenize_list))]
请注意,您可以使用生成器表达式而不是列表理解作为解包的参数。更好的是,使用 chain.from_iterable
:
tokens = [w for w in itertools.chain.from_iterable(
sent.split() for sent in sent_tokenize_list)]
对于标点符号处理,请使用 nltk.tokenize.word_tokenize
而不是 str.split
。它将 return 单词和标点符号作为单独的项目,并拆分例如 I's
到 I
和 's
(这当然是一件好事,因为它们实际上是分开的话,只是简写)。
也许你应该使用 word_tokenize
而不是 sent_tokenize
?
from nltk.tokenize import word_tokenize
text = open(path).read().lower().decode("utf8")
tokens = word_tokenize(text)
http://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.word_tokenize
首先,如果文件在'utf8'中,而你使用的是Python2,那么在io.open()
中使用encoding='utf8'
参数会更好:
import io
from nltk import word_tokenize, sent_tokenize
with io.open('file.txt', 'r', encoding='utf8') as fin:
document = []
for line in fin:
tokens += [word_tokenize(sent) for sent in sent_tokenize(line)]
如果是 Python3,只需执行:
from nltk import word_tokenize
with open('file.txt', 'r') as fin:
document = []
for line in fin:
tokens += [word_tokenize(sent) for sent in sent_tokenize(line)]
看看http://nedbatchelder.com/text/unipain.html
关于标记化,如果我们假设每一行包含某种可能由一个或多个句子组成的段落,我们想首先初始化一个列表来存储整个文档:
document = []
然后我们遍历行并将行拆分为句子:
for line in fin:
sentences = sent_tokenize(line)
然后我们将句子分成标记:
token = [word_tokenize(sent) for sent in sent_tokenize(line)]
由于我们要更新我们的文档列表来存储标记化的句子,我们使用:
document = []
for line in fin:
tokens += [word_tokenize(sent) for sent in sent_tokenize(line)]
不推荐!!!(但仍然可以在一行中):
alvas@ubi:~$ cat file.txt
this is a paragph. with many sentences.
yes, hahaah.. wahahha...
alvas@ubi:~$ python
Python 2.7.11+ (default, Apr 17 2016, 14:00:29)
[GCC 5.3.1 20160413] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import io
>>> from itertools import chain
>>> from nltk import sent_tokenize, word_tokenize
>>> list(chain(*[[word_tokenize(sent) for sent in sent_tokenize(line)] for line in io.open('file.txt', 'r', encoding='utf8')]))
[[u'this', u'is', u'a', u'paragph', u'.'], [u'with', u'many', u'sentences', u'.'], [u'yes', u',', u'hahaah..', u'wahahha', u'...']]