大型文档语料库上的 Sklearn TFIDF

Question

在实习项目中，我必须对大量文件 (~18000) 执行 tfidf 分析。我正在尝试使用 sklearn 中的 TFIDF 矢量化器，但我面临以下问题：如何避免将所有文件一次加载到内存中？根据我在其他帖子上读到的内容，使用可迭代对象似乎是可行的，但是如果我使用例如 [open(file) for file in os.listdir(path)] 作为 raw_documents 输入到 fit_transform() 函数，我收到 'too many open files' 错误。在此先感谢您的建议！干杯！保罗

Answer 1

你在 TfidfVectorizer 中尝试过 input='filename' 参数吗？像这样：

raw_docs_filepaths = [#List containing the filepaths of all the files]

tfidf_vectorizer =  TfidfVectorizer(`input='filename'`)
tfidf_data = tfidf_vectorizer.fit_transform(raw_docs_filepaths)

这应该可行，因为在这种情况下，矢量化器在处理文件时将一次打开一个文件。这可以通过交叉检查 source code here

来确认

def decode(self, doc):
...
...
    if self.input == 'filename':
        with open(doc, 'rb') as fh:
            doc = fh.read()
...
...

大型文档语料库上的 Sklearn TFIDF

Sklearn TFIDF on large corpus of documents

python

scikit-learn

tfidfvectorizer