消除出现在数据框 x% 中的单词

Question

我有一个包含数千行的 pd 数据框。每行包含标记为 text_processed 的列，其中包含文本。这些文本可能很长，每个 row/text 有数百个单词。现在我想消除出现在 95% 的行中的单词。我正在做的是将所有文本连接成一个大字符串并标记该字符串。我现在掌握了所有文本中所有单词的词汇表。我现在想获取每个单词所在的行数。一种简单（且缓慢）的方法是遍历每个单词并比较列中是否存在该单词，然后将结果求和以获得该单词所在的行数.这可以在这里看到：

wordcountPerRow = []
for word in all_words:
    if word in [':', '•', 'and', '%', '\', '|', '-', 'no', 'of', ')', '(', '[', ']', '--', '/', '*', ';', '`', '``', '\'\'', '+']:
        continue
    try:
        wordcountPerRow.append([word, df_note['text_processed'].str.contains(r''+word).sum()])
    except:
        print(word)

一旦我得到所有的总和，我将只做 len(df)*0.95 并查看单词的行数是否 >= 95%，如果为真（布尔列）则删除该单词。这个过程看起来很慢而且计算量很大。有什么办法可以加快速度吗？我可以使用计数矢量化器吗？

与此类似：removing words that appear more than x% in a corpus Python

Answer 1

你试过pd.str()了吗???

我们不知道你的字符串是什么样子，但由于它的 df 我可以假设你可以“切片”应用这个函数的字符串。

您想保留左起的前 10 个字母，您可以使用：

left = df['Your column'].str[:10]

您想保留右起的前 10 个字母，您可以使用：

right = df['Your column'].str[-10:]

只是一个例子，可以作为解决您问题的良好开端

Answer 2

看起来您可以稍微改变一下使用计数矢量化器。不同之处在于，因为 countvectorizer 计算每个文档的出现次数，我们可以简单地应用 bool 掩码 (count_vector > 0)，如果它在文档中出现 1 次或多次，它将把它掩码为 1，如果它0 是 0，它对总和没有贡献。从这里我们可以转置，将索引作为特征名称并简单地 select 出我们想要的百分比间隔。

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(df_note['text_processed'].tolist())#,max_df=0.8, min_df=0.1)#this eliminates words in top 0.8
count_vector=cv.fit_transform(df_note['text_processed'].tolist())

#number of documents a word occurs in
word_document_count = pd.DataFrame(np.array(np.sum(count_vector > 0, axis=0)).transpose() \
                                   , index=cv.get_feature_names(), columns=['Document Count'])

top_perc_num = len(df_note)*0.8
bottom_perc_num = len(df_note)*0.2
word_document_count_trunc = word_document_count[(word_document_count['Document Count'] < top_perc_num) & (word_document_count['Document Count'] > bottom_perc_num)]

我相信这是完成任务的更快方法。我唯一的抱怨是数字似乎与原始方法略有不同。我尝试了一个可重现的小例子，但结果是相同的。

这适用于词汇表中的 200k+ 个单词和 90k+ 行

消除出现在数据框 x% 中的单词

Eliminate the words that appear in x% of your dataframe

python

text-processing

for-loop

pandas