python 中 CountVectorier 稀疏矩阵中的列名称
The names of the columns in CountVectorier sparse matrix in python
当我使用下面的代码时:
from sklearn.feature_extraction.text import CountVectorizer
X = dataset.Tweet
y = dataset.Type
count_vect = CountVectorizer()
BoW = count_vect.fit_transform(X)
它将returns词频文档作为一个稀疏矩阵。
我找到了如何获取稀疏矩阵的数据、索引和 indptr。
我的问题是如何获取列的名称(应该是特征或单词)?
您要使用的是vectorizer.get_feature_names()
。这是文档中的示例:
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
# ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
print(X.toarray())
# [[0 1 1 1 0 0 1 0 1]
# [0 2 0 1 0 1 1 0 1]
# [1 0 0 1 1 0 1 1 1]
# [0 1 1 1 0 0 1 0 1]]
当我使用下面的代码时:
from sklearn.feature_extraction.text import CountVectorizer
X = dataset.Tweet
y = dataset.Type
count_vect = CountVectorizer()
BoW = count_vect.fit_transform(X)
它将returns词频文档作为一个稀疏矩阵。
我找到了如何获取稀疏矩阵的数据、索引和 indptr。
我的问题是如何获取列的名称(应该是特征或单词)?
您要使用的是vectorizer.get_feature_names()
。这是文档中的示例:
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
# ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
print(X.toarray())
# [[0 1 1 1 0 0 1 0 1]
# [0 2 0 1 0 1 1 0 1]
# [1 0 0 1 1 0 1 1 1]
# [0 1 1 1 0 0 1 0 1]]