如何在 Python 中对段落列表进行 Tokenize?

How to sentence Tokenize a list of Paragraphs in Python?

我目前正在学习 word2vec 技术,但在对我的文本数据进行句子标记时遇到了困难。希望有人能帮我弄清楚如何正确地做到这一点。

所以,我的数据是一堆我们客户的投诉记录。当我将数据加载到 python 列表时,它变成了这样的:

text = ['this is the first sentence of the first paragraph. and this is the second sentence.','some random text in the second paragraph. and another test sentence.','here is the third paragraph. and this is another sentence','I have run out of text here. I am learning python and deep learning.','another paragraph with some random text. the this is a learning sample.','I need help implementing word2vec. this all sounds exciting.','it''s sunday and I shoudnt be learning in the first place. it''s nice and sunny here.']

我尝试了社区中一些最常用的 Sentence Tokenizer 方法,但都 return 这个错误:

TypeError: expected string or bytes-like object

最终,我发现了这个:

tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
sentences = tokenizer.tokenize(text[:5][4]) 
sentences

这种方法可行,但我不知道要在 [][] 中放入什么索引,例如:5 & 4 将整个数据集(所有段落)重新标记为句子。

抱歉,如果我的问题含糊不清,请询问您是否需要澄清。

非常感谢

您可以在列表理解中使用 nltk.tokenize.word_tokenize(),如:

In [112]: from nltk.tokenize import word_tokenize
In [113]: tokenized = [word_tokenize(sent) for sent in text]

输出:

[['this',
  'is',
  'the',
  'first',
  'sentence',
  'of',
  'the',
  'first',
  'paragraph',
  '.',
  'and',
  'this',
  'is',
  'the',
  'second',
  'sentence',
  '.'],
 ['some',
  'random',
  'text',
  'in',
  'the',
  'second',
  'paragraph',
  .
  .
  .
  .
  ]]