是否可以为 CountVectorizer 创建等效的 "restrict" 方法,因为它可用于 Scikit-learn 中的 DictVectorizer?

Is it possible to create an equivalent "restrict" method for CountVectorizer as is available for DictVectorizer in Scikit-learn?

对于 DictVectorizer,可以使用 restrict() 方法对对象进行子集化。这是一个示例,其中我使用布尔数组明确列出了要保留的功能。

import numpy as np
v = DictVectorizer()
D = [{'foo': 1, 'bar': 2}, {'foo': 3, 'baz': 1}]
X = v.fit_transform(D)
v.get_feature_names()
>>['bar', 'baz', 'foo']
user_list = np.array([False, False, True], dtype=bool)
v.restrict(user_list)
v.get_feature_names()
>>['foo']

我希望在非规范化 CountVectorizer 对象中具有相同的能力。我还没有发现任何方法来切片来自 CountVectorizer 的 np 对象,因为有很多依赖属性。我感兴趣的原因是,这消除了在简单地去除特征 post hoc 到文本数据的第一次拟合和变换的场景下重复拟合和变换文本数据的需要。是否有我缺少的等效方法,或者是否可以为 CountVectorizer 轻松创建自定义方法?

根据@Vivek 的回复更新

此方法似乎有效。这是我在 python 会话中直接实现的代码。

from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
v = CountVectorizer()
D = ['Data science is about the data', 'The science is amazing', 'Predictive modeling is part of data science']
v.fit_transform(D)
print(v.get_feature_names())
print(len(v.get_feature_names()))
>> ['about', 'amazing', 'data', 'is', 'modeling', 'of', 'part', 'predictive', 'science', 'the']
>> 10
user_list = np.array([False, False, True, False, False, True, False, False, True, False], dtype=bool)

new_vocab = {}
for i in np.where(user_list)[0]:
    print(v.get_feature_names()[i])
    new_vocab[v.get_feature_names()[i]] = len(new_vocab)
new_vocab

>> data
>> of
>> science
>> {'data': 0, 'of': 1, 'science': 2}

v_copy = cp.deepcopy(v)
v_copy.vocabulary_ = new_vocab
print(v_copy.vocabulary_)
print(v_copy.get_feature_names())
v_copy

>> {'data': 0, 'of': 1, 'science': 2}
>> ['data', 'of', 'science']
>> CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\b\w\w+\b',
                tokenizer=None, vocabulary=None)

v_copy.transform(D).toarray()
>> array([[2, 0, 1],
       [0, 0, 1],
       [1, 1, 1]], dtype=int64)

谢谢@Vivek!对于非规范化的 CountVectorizer 对象,这似乎符合预期。

以评论原问题的形式回答执行@Vivek 的建议:

from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
v = CountVectorizer()
D = ['Data science is about the data', 'The science is amazing', 'Predictive modeling is part of data science']
v.fit_transform(D)
print(v.get_feature_names())
print(len(v.get_feature_names()))
>> ['about', 'amazing', 'data', 'is', 'modeling', 'of', 'part', 'predictive', 'science', 'the']
>> 10
user_list = np.array([False, False, True, False, False, True, False, False, True, False], dtype=bool)

new_vocab = {}
for i in np.where(user_list)[0]:
    print(v.get_feature_names()[i])
    new_vocab[v.get_feature_names()[i]] = len(new_vocab)
new_vocab

>> data
>> of
>> science
>> {'data': 0, 'of': 1, 'science': 2}

v_copy = cp.deepcopy(v)
v_copy.vocabulary_ = new_vocab
print(v_copy.vocabulary_)
print(v_copy.get_feature_names())
v_copy

>> {'data': 0, 'of': 1, 'science': 2}
>> ['data', 'of', 'science']
>> CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\b\w\w+\b',
                tokenizer=None, vocabulary=None)

v_copy.transform(D).toarray()
>> array([[2, 0, 1],
       [0, 0, 1],
       [1, 1, 1]], dtype=int64)

您可以按如下方式将一个矢量化器的词汇分配或限制到另一个:

from sklearn.feature_extraction.text import CountVectorizer
    
count_vect1 = CountVectorizer()
count_vect1.fit(list_of_strings1)

count_vect2 = CountVectorizer(vocabulary=count_vect1.vocabulary_)
count_vect2.fit(list_of_strings2)

答案改编自:ValueError: Dimension mismatch