CountVectorizer 因坏词而失败
CountVectorizer fails with bad words
我正在使用 pandas 数据框,我正在尝试获取包含字符串的特定列的单词出现次数。代码运行良好,直到出现以下错误的某行
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-36-af8291199984> in <module>
6
7 cv = CountVectorizer(stop_words=None)
----> 8 cv_fit=cv.fit_transform(texts)
9 word_list = cv.get_feature_names();
10 count_list = cv_fit.toarray().sum(axis=0)
~/anaconda3/envs/turiCreate/lib/python3.8/site-packages/sklearn/feature_extraction/text.py in fit_transform(self, raw_documents, y)
1196 max_features = self.max_features
1197
-> 1198 vocabulary, X = self._count_vocab(raw_documents,
1199 self.fixed_vocabulary_)
1200
~/anaconda3/envs/turiCreate/lib/python3.8/site-packages/sklearn/feature_extraction/text.py in _count_vocab(self, raw_documents, fixed_vocab)
1127 vocabulary = dict(vocabulary)
1128 if not vocabulary:
-> 1129 raise ValueError("empty vocabulary; perhaps the documents only"
1130 " contain stop words")
1131
ValueError: empty vocabulary; perhaps the documents only contain stop words
这是我处理这个字符串的代码:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
texts=[":)"]
cv = CountVectorizer(stop_words=None)
cv_fit=cv.fit_transform(texts)
word_list = cv.get_feature_names();
count_list = cv_fit.toarray().sum(axis=0)
print(word_list)
print(dict(zip(word_list,count_list)))
如何让 CountVectorizer 克服这个问题?
您 运行 遇到的问题是 token_pattern='(?u)\b\w\w+\b'
中指定的标记化模式。您可以根据您的任务调整它:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
texts=["hello :)"]
cv = CountVectorizer(stop_words=None, token_pattern=r'(?u)\b\w\w+\b|[:)]+')
cv_fit=cv.fit_transform(texts)
word_list = cv.get_feature_names();
count_list = cv_fit.toarray().sum(axis=0)
print(word_list)
print(dict(zip(word_list,count_list)))
[':)', 'hello']
{':)': 1, 'hello': 1}
如果您关心表情符号,实现您的目标(如他们所说的“工业”)的更强大的解决方案可能是 spacy:
import spacy
from spacymoji import Emoji
from collections import Counter
nlp = spacy.load('en_core_web_sm')
emoji = Emoji(nlp)
nlp.add_pipe(emoji, first=True)
tokens = [tok for tok in nlp.tokenizer("Hi :) ")]
counts = Counter(tokens)
print(counts)
Counter({Hi: 1, :): 1, : 1})
我正在使用 pandas 数据框,我正在尝试获取包含字符串的特定列的单词出现次数。代码运行良好,直到出现以下错误的某行
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-36-af8291199984> in <module>
6
7 cv = CountVectorizer(stop_words=None)
----> 8 cv_fit=cv.fit_transform(texts)
9 word_list = cv.get_feature_names();
10 count_list = cv_fit.toarray().sum(axis=0)
~/anaconda3/envs/turiCreate/lib/python3.8/site-packages/sklearn/feature_extraction/text.py in fit_transform(self, raw_documents, y)
1196 max_features = self.max_features
1197
-> 1198 vocabulary, X = self._count_vocab(raw_documents,
1199 self.fixed_vocabulary_)
1200
~/anaconda3/envs/turiCreate/lib/python3.8/site-packages/sklearn/feature_extraction/text.py in _count_vocab(self, raw_documents, fixed_vocab)
1127 vocabulary = dict(vocabulary)
1128 if not vocabulary:
-> 1129 raise ValueError("empty vocabulary; perhaps the documents only"
1130 " contain stop words")
1131
ValueError: empty vocabulary; perhaps the documents only contain stop words
这是我处理这个字符串的代码:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
texts=[":)"]
cv = CountVectorizer(stop_words=None)
cv_fit=cv.fit_transform(texts)
word_list = cv.get_feature_names();
count_list = cv_fit.toarray().sum(axis=0)
print(word_list)
print(dict(zip(word_list,count_list)))
如何让 CountVectorizer 克服这个问题?
您 运行 遇到的问题是 token_pattern='(?u)\b\w\w+\b'
中指定的标记化模式。您可以根据您的任务调整它:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
texts=["hello :)"]
cv = CountVectorizer(stop_words=None, token_pattern=r'(?u)\b\w\w+\b|[:)]+')
cv_fit=cv.fit_transform(texts)
word_list = cv.get_feature_names();
count_list = cv_fit.toarray().sum(axis=0)
print(word_list)
print(dict(zip(word_list,count_list)))
[':)', 'hello']
{':)': 1, 'hello': 1}
如果您关心表情符号,实现您的目标(如他们所说的“工业”)的更强大的解决方案可能是 spacy:
import spacy
from spacymoji import Emoji
from collections import Counter
nlp = spacy.load('en_core_web_sm')
emoji = Emoji(nlp)
nlp.add_pipe(emoji, first=True)
tokens = [tok for tok in nlp.tokenizer("Hi :) ")]
counts = Counter(tokens)
print(counts)
Counter({Hi: 1, :): 1, : 1})