无法散列的类型:'list' 停用词错误
Unhashable type: 'list' error for stopwords
这是我的代码
URL 到 CSV 文件:https://github.com/eugeneketeni/web-mining-final-project/blob/master/Test_file.csv
import pandas as pd
data = pd.read_csv("https://raw.githubusercontent.com/eugeneketeni/web-
mining-final-project/master/Test_file.csv")
import nltk
from nltk import word_tokenize, sent_tokenize
data['text'] = data.loc[:, 'text'].astype(str)
text = data.loc[:, "text"].astype(str)
tokenizer = [word_tokenize(text[i]) for i in range(len(text))]
print(tokenizer)
filtered_sentence = []
from nltk.corpus import stopwords
stopwords = set(stopwords.words('english'))
filtered_sentence = [w for w in tokenizer if not w in stopwords]
print(filtered_sentence)
我的分词器可以工作,但是当我尝试删除默认停用词时,我不断收到 "unhashable type: 'list'" 错误。我不确定到底发生了什么。我将不胜感激任何帮助。谢谢
TL;DR
from nltk import word_tokenize
from nltk.corpus import stopwords
import pandas as pd
stoplist = set(stopwords.words('english'))
data = pd.read_csv("Test_file.csv")
data['filtered_text'] = data['text'].astype(str).apply(lambda line: [token for token in word_tokenize(line) if token not in stoplist])
中龙
请参阅以获取有关以下内容的更详细说明:
- 标记数据框中的文本
- 删除停用词
- 其他相关清洗流程
为了更好,twitter 文本处理
pip3 install -U nltk[twitter]
然后使用这个:
从 nltk.corpus 导入停用词
from nltk.tokenize import TweetTokenizer
import pandas as pd
word_tokenize = TweetTokenizer().tokenize
stoplist = set(stopwords.words('english'))
data = pd.read_csv("Test_file.csv")
data['filtered_text'] = data['text'].astype(str).apply(lambda line: [token for token in word_tokenize(line) if token not in stoplist])
这是我的代码
URL 到 CSV 文件:https://github.com/eugeneketeni/web-mining-final-project/blob/master/Test_file.csv
import pandas as pd
data = pd.read_csv("https://raw.githubusercontent.com/eugeneketeni/web-
mining-final-project/master/Test_file.csv")
import nltk
from nltk import word_tokenize, sent_tokenize
data['text'] = data.loc[:, 'text'].astype(str)
text = data.loc[:, "text"].astype(str)
tokenizer = [word_tokenize(text[i]) for i in range(len(text))]
print(tokenizer)
filtered_sentence = []
from nltk.corpus import stopwords
stopwords = set(stopwords.words('english'))
filtered_sentence = [w for w in tokenizer if not w in stopwords]
print(filtered_sentence)
我的分词器可以工作,但是当我尝试删除默认停用词时,我不断收到 "unhashable type: 'list'" 错误。我不确定到底发生了什么。我将不胜感激任何帮助。谢谢
TL;DR
from nltk import word_tokenize
from nltk.corpus import stopwords
import pandas as pd
stoplist = set(stopwords.words('english'))
data = pd.read_csv("Test_file.csv")
data['filtered_text'] = data['text'].astype(str).apply(lambda line: [token for token in word_tokenize(line) if token not in stoplist])
中龙
请参阅
- 标记数据框中的文本
- 删除停用词
- 其他相关清洗流程
为了更好,twitter 文本处理
pip3 install -U nltk[twitter]
然后使用这个:
从 nltk.corpus 导入停用词
from nltk.tokenize import TweetTokenizer
import pandas as pd
word_tokenize = TweetTokenizer().tokenize
stoplist = set(stopwords.words('english'))
data = pd.read_csv("Test_file.csv")
data['filtered_text'] = data['text'].astype(str).apply(lambda line: [token for token in word_tokenize(line) if token not in stoplist])