如何标记文本语料库?
How to tokenize a text corpus?
我想使用 NLTK 库标记文本语料库。
我的语料库看起来像:
['Did you hear about the Native American man that drank 200 cups of tea?',
"What's the best anti diarrheal prescription?",
'What do you call a person who is outside a door and has no arms nor legs?',
'Which Star Trek character is a member of the magic circle?',
"What's the difference between a bullet and a human?",
我试过:
tok_corp = [nltk.word_tokenize(sent.decode('utf-8')) for sent in corpus]
提出了:
AttributeError: 'str' object has no attribute 'decode'
我们将不胜感激。谢谢
错误就在那里,sent
没有属性 decode
。如果它们首先被编码,你只需要 .decode()
它们,即 bytes
个对象而不是 str
个对象。去掉那个应该没问题。
正如 this page 建议的那样 word_tokenize 方法需要一个字符串作为参数,只需尝试
tok_corp = [nltk.word_tokenize(sent) for sent in corpus]
编辑:使用以下代码我可以获得标记化的语料库,
代码:
import pandas as pd
from nltk import word_tokenize
corpus = ['Did you hear about the Native American man that drank 200 cups of tea?',
"What's the best anti diarrheal prescription?",
'What do you call a person who is outside a door and has no arms nor legs?',
'Which Star Trek character is a member of the magic circle?',
"What's the difference between a bullet and a human?"]
tok_corp = pd.DataFrame([word_tokenize(sent) for sent in corpus])
输出:
0 1 2 3 4 ... 13 14 15 16 17
0 Did you hear about the ... tea ? None None None
1 What 's the best anti ... None None None None None
2 What do you call a ... no arms nor legs ?
3 Which Star Trek character is ... None None None None None
4 What 's the difference between ... None None None None None
我认为您的语料库中潜藏了一些非字符串或非字节类对象。我建议你再看看。
我想使用 NLTK 库标记文本语料库。
我的语料库看起来像:
['Did you hear about the Native American man that drank 200 cups of tea?',
"What's the best anti diarrheal prescription?",
'What do you call a person who is outside a door and has no arms nor legs?',
'Which Star Trek character is a member of the magic circle?',
"What's the difference between a bullet and a human?",
我试过:
tok_corp = [nltk.word_tokenize(sent.decode('utf-8')) for sent in corpus]
提出了:
AttributeError: 'str' object has no attribute 'decode'
我们将不胜感激。谢谢
错误就在那里,sent
没有属性 decode
。如果它们首先被编码,你只需要 .decode()
它们,即 bytes
个对象而不是 str
个对象。去掉那个应该没问题。
正如 this page 建议的那样 word_tokenize 方法需要一个字符串作为参数,只需尝试
tok_corp = [nltk.word_tokenize(sent) for sent in corpus]
编辑:使用以下代码我可以获得标记化的语料库,
代码:
import pandas as pd
from nltk import word_tokenize
corpus = ['Did you hear about the Native American man that drank 200 cups of tea?',
"What's the best anti diarrheal prescription?",
'What do you call a person who is outside a door and has no arms nor legs?',
'Which Star Trek character is a member of the magic circle?',
"What's the difference between a bullet and a human?"]
tok_corp = pd.DataFrame([word_tokenize(sent) for sent in corpus])
输出:
0 1 2 3 4 ... 13 14 15 16 17
0 Did you hear about the ... tea ? None None None
1 What 's the best anti ... None None None None None
2 What do you call a ... no arms nor legs ?
3 Which Star Trek character is ... None None None None None
4 What 's the difference between ... None None None None None
我认为您的语料库中潜藏了一些非字符串或非字节类对象。我建议你再看看。