get_feature_names 在 countvectorizer() 中找不到
get_feature_names not found in countvectorizer()
我正在挖掘有关深度学习库的帖子的 Stack Overflow 数据转储。我想在我的语料库中识别停用词(例如 'python')。我想获取我的特征名称,以便识别词频最高的词。
我按如下方式创建文档和语料库:
with open("Whosebug_2018_Data.csv") as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')
line_count = 0
pytorch_doc = ''
tensorflow_doc = ''
cotag_list = []
keras_doc = ''
counte = 0
for row in csv_reader:
if row[2] == 'tensorflow':
tensorflow_doc += row[3] + ' '
if row[2] == 'keras':
keras_doc += row[3] + ' '
if row[2] == 'pytorch':
pytorch_doc += row[3] + ' '
corpus = [pytorch_doc, tensorflow_doc, keras_doc]
vectorizer = CountVectorizer()
x = vectorizer.fit_transform(corpus)
print(x)
x.toarray()
Dict = []
feat = x.get_feature_names()
for i,arr in enumerate(x):
for x, ele in enumerate(arr):
if i == 0:
Dict += ('pytorch', feat[x], ele)
if i == 1:
Dict += ('tensorflow', feat[x], ele)
if i == 2:
Dict += ('keras', feat[x], ele)
sorted_arr = sorted(Dict, key=lambda tup: tup[2])
但是,我得到:
File "sklearn_stopwords.py", line 83, in <module>
main()
File "sklearn_stopwords.py", line 50, in main
feat = x.get_feature_names()
File "/opt/anaconda3/lib/python3.7/site-packages/scipy/sparse/base.py", line 686, in __getattr__
raise AttributeError(attr + " not found")
AttributeError: get_feature_names not found
get_feature_names
是 CountVectorizer 对象中的一个方法。您正在尝试访问 get_feature_names fit_transform 的结果,这是一个 scipy.sparse 矩阵。
您需要使用vectorizer.get_feature_names()
。
试试这个 MVCE:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
corpus = ['This is the first document.',
'This is the second second document.',
'And the third one.',
'Is this the first document?']
X = vectorizer.fit_transform(corpus)
features = vectorizer.get_feature_names()
features
输出:
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
我正在挖掘有关深度学习库的帖子的 Stack Overflow 数据转储。我想在我的语料库中识别停用词(例如 'python')。我想获取我的特征名称,以便识别词频最高的词。
我按如下方式创建文档和语料库:
with open("Whosebug_2018_Data.csv") as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')
line_count = 0
pytorch_doc = ''
tensorflow_doc = ''
cotag_list = []
keras_doc = ''
counte = 0
for row in csv_reader:
if row[2] == 'tensorflow':
tensorflow_doc += row[3] + ' '
if row[2] == 'keras':
keras_doc += row[3] + ' '
if row[2] == 'pytorch':
pytorch_doc += row[3] + ' '
corpus = [pytorch_doc, tensorflow_doc, keras_doc]
vectorizer = CountVectorizer()
x = vectorizer.fit_transform(corpus)
print(x)
x.toarray()
Dict = []
feat = x.get_feature_names()
for i,arr in enumerate(x):
for x, ele in enumerate(arr):
if i == 0:
Dict += ('pytorch', feat[x], ele)
if i == 1:
Dict += ('tensorflow', feat[x], ele)
if i == 2:
Dict += ('keras', feat[x], ele)
sorted_arr = sorted(Dict, key=lambda tup: tup[2])
但是,我得到:
File "sklearn_stopwords.py", line 83, in <module>
main()
File "sklearn_stopwords.py", line 50, in main
feat = x.get_feature_names()
File "/opt/anaconda3/lib/python3.7/site-packages/scipy/sparse/base.py", line 686, in __getattr__
raise AttributeError(attr + " not found")
AttributeError: get_feature_names not found
get_feature_names
是 CountVectorizer 对象中的一个方法。您正在尝试访问 get_feature_names fit_transform 的结果,这是一个 scipy.sparse 矩阵。
您需要使用vectorizer.get_feature_names()
。
试试这个 MVCE:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
corpus = ['This is the first document.',
'This is the second second document.',
'And the third one.',
'Is this the first document?']
X = vectorizer.fit_transform(corpus)
features = vectorizer.get_feature_names()
features
输出:
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']