如何标记文本语料库?

How to tokenize a text corpus?

我想使用 NLTK 库标记文本语料库。

我的语料库看起来像:

['Did you hear about the Native American man that drank 200 cups of tea?',
 "What's the best anti diarrheal prescription?",
 'What do you call a person who is outside a door and has no arms nor legs?',
 'Which Star Trek character is a member of the magic circle?',
 "What's the difference between a bullet and a human?",

我试过:

tok_corp = [nltk.word_tokenize(sent.decode('utf-8')) for sent in corpus]

提出了:

AttributeError: 'str' object has no attribute 'decode'

我们将不胜感激。谢谢

错误就在那里,sent 没有属性 decode。如果它们首先被编码,你只需要 .decode() 它们,即 bytes 个对象而不是 str 个对象。去掉那个应该没问题。

正如 this page 建议的那样 word_tokenize 方法需要一个字符串作为参数,只需尝试

tok_corp = [nltk.word_tokenize(sent) for sent in corpus]

编辑:使用以下代码我可以获得标记化的语料库,

代码:

import pandas as pd
from nltk import word_tokenize

corpus = ['Did you hear about the Native American man that drank 200 cups of tea?',
 "What's the best anti diarrheal prescription?",
 'What do you call a person who is outside a door and has no arms nor legs?',
 'Which Star Trek character is a member of the magic circle?',
 "What's the difference between a bullet and a human?"]


tok_corp = pd.DataFrame([word_tokenize(sent) for sent in corpus])

输出:

      0     1     2           3        4   ...    13    14    15    16    17
0    Did   you  hear       about      the  ...   tea     ?  None  None  None
1   What    's   the        best     anti  ...  None  None  None  None  None
2   What    do   you        call        a  ...    no  arms   nor  legs     ?
3  Which  Star  Trek   character       is  ...  None  None  None  None  None
4   What    's   the  difference  between  ...  None  None  None  None  None

我认为您的语料库中潜藏了一些非字符串或非字节类对象。我建议你再看看。