使用 TfidfVectorizer,是否可以将一个语料库用于 idf 信息,而将另一个语料库用于实际索引?

With TfidfVectorizer, is it possible to use one corpus for idf information, and another one for the actual index?


我想用词袋 tf-idf 数据训练分类器。


我打算使用带标签的语料库构建分类器,基于带有tf-idf模型的词袋。 但是,我更喜欢使用完整的语料库(包括未标记的数据)来计算 idf 统计信息。

使用 sklearn 时可以吗?

我想到的一个解决方案是建立所有语料库的模型,然后删除属于未标记数据的行。但是,语料库可能太大而无法存储在 ram 中。

如果我理解正确的话。您可以将 TFIDF 模型拟合到所有数据,然后在较小的标记语料库上调用 transform

vec =TfidfVectorizer()
model = vec.fit(alldata)
tagged_data_tfidf = vec.transform(tagged_data)


关于不适合 RAM 的数据,可以使用迭代器,如果数据分布在不同的来源,则可以使用多个迭代器。在我的例子中,标记数据存储在文件中,而我的数据存储在 mongoDB 中: 文件迭代器:

class File2Doc(object):
    def __init__(self, top_dir):
        self.top_dir = top_dir

    def __iter__(self):
        for root, dirs, files in os.walk(self.top_dir):
            for fname in filter(lambda fname: fname.endswith('.txt'), files):
                with open(os.path.join(root, fname), encoding='utf8', errors='ignore') as file:
                    document = file.read()
                    yield document


class Mongo2Doc(object):
        an iterator that builds a find pymongo cursor and saves the text field in the mongodb collection
    def __init__(self, query):
        self.cur = query.cur
        self.text_field = query.text_field

    def __iter__(self):
        for document in self.cur:
            yield document[self.text_field]


class MyDocIterator(object):
    Expects a list of [folders] (paths) and/or a list of mongoDB [queries]
    mongoDB queries have the form (collection_name, {find_query}, {projection: or text_field})
    mongo_query = [mongo_client.db.collection, {'optional_query': 'some_value'}, {'text':1}]

    def __init__(self, folders=None, mongo_query=None):
        self.folders = folders
        self.mongo_query = mongo_query
        if self.folders is not None:
            assert isinstance(self.folders, list), 'folders should be a list'
        if self.mongo_query is not None:
            assert isinstance(self.mongo_query,
                              list), 'Mongo query should be a list'
        if self.folders is None and self.mongo_query is None:
            raise TypeError(
                'Please specify at least one folder or one mongo query')

    def __iter__(self):
        k = []
        if self.folders is not None:
            f = [File2Doc(folder) for folder in self.folders]
        if self.mongo_query is not None:
            m = [Mongo2Doc(query) for query in self.mongo_query]
        return chain.from_iterable(k)


my_docs = MyDocIterator(['path_to_data'])
bow_vectorizer = CountVectorizer(preprocessor=custom_text_preprocessor, tokenizer=str.split)

同样适用于 TfidfVectorizer