如何使用 sklearn 的 CountVectorizer 进行矢量化和去矢量化?

How to vectorize and devectorize using sklearn's CountVectorizer?

我想将一些文本向量化为相应的整数,然后将这些文本转换为其映射的整数,并使用新的输入整数创建新句子 [2,9,39,46,56,12,89,9]

我看到了一些可以用于此目的的自定义函数,但我想知道 sklearn 本身是否有这样的函数。

from sklearn.feature_extraction.text import CountVectorizer

a=["""Lorem ipsum dolor sit amet, consectetur adipiscing elit. 
Morbi imperdiet mauris posuere, condimentum odio et, volutpat orci.
Curabitur sodales vulputate eros eu gravida. Sed pharetra imperdiet nunc et tempor.
Nullam lectus est, rhoncus vitae lacus at, fermentum aliquam metus.
Phasellus a sollicitudin tortor, non tempor nulla.
Etiam mattis felis enim, a malesuada ligula dignissim at.
Integer congue dolor ut magna blandit, lobortis consequat ante aliquam.
Nulla imperdiet libero eget lorem sagittis, eget iaculis orci dignissim. 
Phasellus sit amet sodales odio. Pellentesque commodo tempor risus, et tincidunt neque. 
Praesent et sem velit. Maecenas id risus sit amet ex convallis ultrices vel sed purus. 
Sed fringilla, leo quis congue sollicitudin, mauris nunc vehicula mi, et laoreet ligula 
urna et nulla. Nam sollicitudin urna sed dolor vehicula euismod. Mauris bibendum pulvinar
ornare. In suscipit sed mi ut posuere.
Proin egestas, nibh ut egestas mattis, ipsum nulla bibendum enim, ac suscipit nisl justo 
id metus. Nam est dui, elementum eget suscipit nec, aliquam in mi. Integer tortor erat,
aliquet at sapien et, fringilla posuere leo. Praesent non congue est. Vivamus tincidunt
tellus eu placerat tincidunt. Phasellus convallis lacus vitae ex congue efficitur.
Sed ut bibendum massa, vitae molestie ligula. Phasellus purus felis, fermentum vitae 
hendrerit vel, vulputate quis metus."""]


vec = CountVectorizer()
dtm=vec.fit_transform(a)
print vec.vocabulary_

#convert text to corresponding vectors
mapped_a=

#new sentence using below mapped values
#input [2,9,39,46,56,12,89,9]
#creating sentence using specific sequence

new_sentence=

看看sklearn中的预处理库,LabelEncoder和OneHotEncoder通常用于编码分类变量。但不建议对整个文本进行编码!

要将句子向量化为整数,您可以使用 transform 函数。此函数的输出是包含每个术语计数的向量 - 特征向量。

vec = CountVectorizer()
vec.fit(a)
print vec.vocabulary_

new_sentence = "dolor nulla enim"
mapped_a = vec.transform([new_sentence])
print mapped_a.toarray() # sparse feature vector

tokenizer = vec.build_tokenizer()
# array of words ids
for token in tokenizer(new_sentence):
    print vec.vocabulary_.get(token)

问题的第二部分不是那么简单。 CountVectorizer 具有用于此目的的 inverse_transform 函数,将特征的稀疏向量作为输入。但是,在您的示例中,您希望创建一个句子,其中可能会出现相同的术语,而使用该功能是不可能的。

然而,解决方案是使用词汇表(word to id)并基于它构建逆向词汇表(id to word)。 CountVectorizer 默认没有 inverse_vocabulary 必须根据 vocabulary.

创建
input = [2,9,9]

# 1. inverse_transform function
# create sparse vector
sparse_input = [1 if i in input else 0 for i in range(0, len(vec.vocabulary_))]
print vec.inverse_transform(sparse_input)
> ['aliquam', 'commodo']


# 2. Inverse vocabulary - custom solution
terms = np.array(list(vec.vocabulary_.keys()))
indices = np.array(list(vec.vocabulary_.values()))
inverse_vocabulary = terms[np.argsort(indices)]

for i in input:
    print inverse_vocabulary[i]
> ['aliquam', 'commodo', 'commodo']