从特征到单词 python("reverse" 词袋)
From featurers to words python ("reverse" bag of words)
我使用 sklearn 在 Python 中创建了一个包含 200 个特征的 BOW,这些特征很容易提取。但是,我该如何扭转呢?也就是说,从一个有 200 个 0 或 1 的向量到相应的词?由于词汇表是一本字典,因此没有排序,我不确定特征列表中的每个元素对应于哪个词。另外,如果我的 200 维向量中的第一个元素对应于字典中的第一个单词,那么我如何通过索引从字典中提取单词?
BOW是这样制作的
vec = CountVectorizer(stop_words = sw, strip_accents="unicode", analyzer = "word", max_features = 200)
features = vec.fit_transform(data.loc[:,"description"]).todense()
因此"features"是一个矩阵(n,200)矩阵(n是句子的个数)
我不太确定你要做什么,但你似乎只是想弄清楚哪一列代表哪个词。为此,有方便的 get_feature_names
参数。
让我们看看docs中提供的示例语料库:
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?' ]
# Put into a dataframe
data = pd.DataFrame(corpus,columns=['description'])
# Take a look:
>>> data
description
0 This is the first document.
1 This document is the second document.
2 And this is the third one.
3 Is this the first document?
# Initialize CountVectorizer (you can put in your arguments, but for the sake of example, I'm keeping it simple):
vec = CountVectorizer()
# Fit it as you had before:
features = vec.fit_transform(data.loc[:,"description"]).todense()
>>> features
matrix([[0, 1, 1, 1, 0, 0, 1, 0, 1],
[0, 2, 0, 1, 0, 1, 1, 0, 1],
[1, 0, 0, 1, 1, 0, 1, 1, 1],
[0, 1, 1, 1, 0, 0, 1, 0, 1]], dtype=int64)
要查看哪个列代表哪个单词,请使用 get_feature_names
:
>>> vec.get_feature_names()
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
所以您的第一列是 and
,第二列是 document
,依此类推。为了便于阅读,您可以将其粘贴在数据框中:
>>> pd.DataFrame(features, columns = vec.get_feature_names())
and document first is one second the third this
0 0 1 1 1 0 0 1 0 1
1 0 2 0 1 0 1 1 0 1
2 1 0 0 1 1 0 1 1 1
3 0 1 1 1 0 0 1 0 1
我使用 sklearn 在 Python 中创建了一个包含 200 个特征的 BOW,这些特征很容易提取。但是,我该如何扭转呢?也就是说,从一个有 200 个 0 或 1 的向量到相应的词?由于词汇表是一本字典,因此没有排序,我不确定特征列表中的每个元素对应于哪个词。另外,如果我的 200 维向量中的第一个元素对应于字典中的第一个单词,那么我如何通过索引从字典中提取单词?
BOW是这样制作的
vec = CountVectorizer(stop_words = sw, strip_accents="unicode", analyzer = "word", max_features = 200)
features = vec.fit_transform(data.loc[:,"description"]).todense()
因此"features"是一个矩阵(n,200)矩阵(n是句子的个数)
我不太确定你要做什么,但你似乎只是想弄清楚哪一列代表哪个词。为此,有方便的 get_feature_names
参数。
让我们看看docs中提供的示例语料库:
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?' ]
# Put into a dataframe
data = pd.DataFrame(corpus,columns=['description'])
# Take a look:
>>> data
description
0 This is the first document.
1 This document is the second document.
2 And this is the third one.
3 Is this the first document?
# Initialize CountVectorizer (you can put in your arguments, but for the sake of example, I'm keeping it simple):
vec = CountVectorizer()
# Fit it as you had before:
features = vec.fit_transform(data.loc[:,"description"]).todense()
>>> features
matrix([[0, 1, 1, 1, 0, 0, 1, 0, 1],
[0, 2, 0, 1, 0, 1, 1, 0, 1],
[1, 0, 0, 1, 1, 0, 1, 1, 1],
[0, 1, 1, 1, 0, 0, 1, 0, 1]], dtype=int64)
要查看哪个列代表哪个单词,请使用 get_feature_names
:
>>> vec.get_feature_names()
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
所以您的第一列是 and
,第二列是 document
,依此类推。为了便于阅读,您可以将其粘贴在数据框中:
>>> pd.DataFrame(features, columns = vec.get_feature_names())
and document first is one second the third this
0 0 1 1 1 0 0 1 0 1
1 0 2 0 1 0 1 1 0 1
2 1 0 0 1 1 0 1 1 1
3 0 1 1 1 0 0 1 0 1