sklearn tfidfvectorizer：如何与列上的 tfidf 框架相交？

Question

在 R 中，我可以提取包含特定术语的行（文档），比如 'toyota' 通过将文档术语矩阵 (dtm) 与所需的列名称相交，如下所示：

dtm <- DocumentTermMatrix(mycorpus, control = list(tokenize = TrigramTokenizer))
x.df<-as.matrix(dtm[1:ncorpus, intersect(colnames(dtm), "toyota"),drop=FALSE])

问题是我在 Python sklearn 包中找不到等效的方法。所以我绕道而行：

首先，我获取 tfidf 框架中相关列 ("toyota") 不为空的行的索引值；列名称是特征名称。
然后我根据已识别的行索引对主 pandas 数据帧进行切片。
现在我有一个数据框，其中每一行包含 "toyota"。

这里是 MVP：

rows_to_keep=tfidf_df[tfidf_df.toyota.notnull()].index data=my_df.loc[rows_to_keep,:] print(data.shape)

这行得通。问题是如何将迭代器传递给此语句？

car_make=['toyota','ford','nissan','gmotor','honda','suzuki']

然后for zentity in car_make:

rows_to_keep=tfidf_df[tfidf_df.zentity.notnull()].index

无效。

AttributeError: 'SparseDataFrame' object has no attribute 'zentity'

我特意选择了 zentity 以避免与 tfidf 中的任何列名等价。

是否有一种干净的方法来创建交集并仅提取列不为空 (NaN) 的行？任何帮助将不胜感激。

Answer 1

而不是 rows_to_keep=tfidf_df[tfidf_df.zentity.notnull()].index

你应该使用像 rows_to_keep=tfidf_df[tfidf_df[zentity].notnull()].index

使用像 zentity 这样的变量，即使它存储一个字符串，属性访问 tfidf_df 的列似乎总是失败。我现在不确定为什么（我认为这与创建 DataFrame 时如何处理列名以及 class 对象属性访问通常如何工作有关），但我会查找它。

sklearn tfidfvectorizer：如何与列上的 tfidf 框架相交？

sklearn tfidfvectorizer: how to intersect a tfidf frame on a column?

tf-idf

python-3.x

scikit-learn

sklearn-pandas