从数据集中删除最频繁的单词

Question

我正在尝试处理文本，其中有很多重复。我之前使用过 SKLearn 的 tf-idf 矢量器，它有一个参数 max_df=0.5。这意味着如果该词出现在超过 50% 的输入中，则不会使用它。我想知道一般 Python 或 Doc2Vec 或 NLTK 中是否有类似的功能：我想删除出现在超过 50% 的数据集中的词，而不对它们进行向量化。

例如，我想从这样的数据框制作：

0 | This is new: A puppy ate cheese! See?
1 | This is new: A cat was found. See?
2 | This is new: Problems arise. See?

这样的输出：

0 | puppy ate cheese
1 | cat was found
2 | problems arise

我已经完成了大写和停用词的删除，现在我只想删除最常用的词。我还想存储这些信息，因为可能会有新的输入，我想从新的输入中删除我发现在原始语料库中频繁出现的相同频繁词。

Answer 1

你可以

import nltk 
allWords = nltk.tokenize.word_tokenize(text)
allWordDist = nltk.FreqDist(w.lower() for w in allWords)

接着是

mostCommon= allWordDist.most_common(10).keys()

在预处理中？

如果你查看

allWordDist .items()

我想你会找到你需要的一切。

从数据集中删除最频繁的单词

Drop most frequent words from dataset

python

text

nltk

scikit-learn

doc2vec