tf-idf 用于大量文档（>100k）

Question

所以我正在为非常大的语料库（100k 文档）做 tf-idf，它给了我内存错误。有什么植入可以很好地处理如此大量的文件吗？我想制作自己的停用词列表。此外，它适用于 50k 个文档，如果有一个（sklearn 植入），我可以在此计算中使用的文档数量限制是多少。

  def tf_idf(self, df):
    df_clean, corpus = self.CleanText(df)
    tfidf=TfidfVectorizer().fit(corpus)
    count_tokens=tfidf.get_feature_names_out()
    article_vect = tfidf.transform(corpus)
    tf_idf_DF=pd.DataFrame(data=article_vect.toarray(),columns=count_tokens)
    tf_idf_DF = pd.DataFrame(tf_idf_DF.sum(axis=0).sort_values(ascending=False))

    return tf_idf_DF

错误：内存错误：无法为形状为 (96671, 90622) 且数据类型为 float64 的数组分配 65.3 GiB

提前致谢。

Answer 1

TfidfVectorizer 有很多参数(TfidfVectorizer)，您应该设置max_df=0.9、min_df=0.1 和max_features=500 以及网格搜索这些参数以获得最佳解决方案。

如果不设置这些参数，您将得到一个形状为 (96671, 90622) 的巨大稀疏矩阵，这会导致内存错误..

欢迎来到 nlp

Answer 2

正如@NickODell 所说，只有当您将稀疏矩阵转换为密集矩阵时才会出现内存错误。解决方案是只使用稀疏矩阵做你想做的一切

  def tf_idf(self, df):
    
    df_clean, corpus = self.CleanText(df)
    tfidf=TfidfVectorizer().fit(corpus)
    count_tokens=tfidf.get_feature_names_out()
    article_vect = tfidf.transform(corpus)
    #The following line is the solution:
    tf_idf_DF=pd.DataFrame(data=article_vect.tocsr().sum(axis=0),columns=count_tokens)
    tf_idf_DF = tf_idf_DF.T.sort_values(ascending=False, by=[0])

    tf_idf_DF['word'] = tf_idf_DF.index
    tf_idf_DF['tf-idf'] = tf_idf_DF[0]
    tf_idf_DF = tf_idf_DF.reset_index().drop(['index', 0],axis=1)
    
    return tf_idf_DF

这就是解决方案。

tf-idf 用于大量文档（>100k）

tf-idf for large number of documents (>100k)

python

tf-idf

tfidfvectorizer