如何从稀疏矩阵中获取词汇序列

Question

我有一个词汇列表 ['Human', 'interface', 'machine', 'binary', 'minors', 'ESP', 'system', 'Graph'] 和一个句子列表 ["Human machine interface for lab abc computer applications", "A survey of user opinion of computer system response time", "The EPS user interface management system", "Relation of user perceived response time to error measurement", "The generation of random binary unordered trees", "The intersection graph of paths in trees", "Graph minors IV Widths of trees and well quasi ordering", "Graph minors A survey"]。我使用 'sklearn' 中的 'CountVectorizer' 来根据这八个词将句子拟合到一个稀疏矩阵中。我在下面得到一个输出。

[[0 0 0 0 0 1 0 1]
 [0 0 0 0 1 0 0 0]
 [0 0 0 0 1 0 0 1]
 [0 0 0 0 1 0 0 0]
 [0 0 0 0 0 0 0 0]
 [1 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0]
 [0 1 0 0 0 0 0 0]
 [0 1 0 0 0 0 0 0]]

现在我试图找出矩阵中这八个单词的顺序。任何帮助将不胜感激。

Answer 1

CountVectorizer 默认使用小写字母，因此 'Human'、'Graph'、'ESP' 没有匹配项。在您的结果中，词汇向量似乎以某种方式排序。

您可以设置 lowercase = False。

lowercaseboolean, True by default Convert all characters to lowercase before tokenizing. sclearn doc

我喜欢这个。

from sklearn.feature_extraction.text import CountVectorizer

corpus = ["Human machine interface for lab abc computer applications", "A survey of user opinion of computer system response time", "The EPS user interface management system", "Relation of user perceived response time to error measurement", "The generation of random binary unordered trees", "The intersection graph of paths in trees", "Graph minors IV Widths of trees and well quasi ordering", "Graph minors A survey"
]

voc = ['Human', 'interface', 'machine', 'binary', 'minors', 'ESP', 'system', 'Graph']

vectorizer = CountVectorizer(vocabulary=voc, lowercase=False)

X = vectorizer.fit_transform(corpus)

print(vectorizer.get_feature_names())
print(X.toarray())


#     ['Human', 'interface', 'machine', 'binary', 'minors', 'ESP', 'system', 'Graph']
#     [[1 1 1 0 0 0 0 0]
#      [0 0 0 0 0 0 1 0]
#      [0 1 0 0 0 0 1 0]
#      [0 0 0 0 0 0 0 0]
#      [0 0 0 1 0 0 0 0]
#      [0 0 0 0 0 0 0 0]
#      [0 0 0 0 1 0 0 1]
#      [0 0 0 0 1 0 0 1]]

在矩阵中，每一行都是一个句子的voc匹配。所以这种情况 'Human'、'interface'、'machine' 匹配第一行（句子）。

如何从稀疏矩阵中获取词汇序列

How do I get the sequence of vocabulary from a sparse matrix

python

scikit-learn

text-classification