删除数据框每个标记化行中的停用词

Question

我正在尝试从我的数据框的每一行中删除停用词并将其放入新的数据框列 S。

我试过下面的代码，但它似乎不起作用...

from nltk.corpus import stopwords
stopwords = stopwords.words('english')

df['S'] = df.apply(lambda row: (word for word in row['remarks_tokenized'] if word.lower() not in stopwords), axis=1)

Answer 1

我对不同的语料库进行了尝试，效果很好。

from nltk.corpus import stopwords  
from nltk.tokenize import word_tokenize  
stop_words = set(stopwords.words('english'))  

def remove_stopwords(sentence):
    word_tokens = word_tokenize(sentence)  
    clean_tokens = [w for w in word_tokens if not w in stop_words]  
    
    return clean_tokens
    
df['S'] = df['remarks'].apply(remove_stopwords)

输出：

0     [microsoft, word, arma2011paper353, prediction...
1     [2504, 0478, matava, qxd, gulf, mexico, mature...
2     [lithospheric, structure, texas, gulf, mexico,...
4     [int, see, discussions, stats, author, profile...
5     [bltn9556, authors, thomas, r, taylor, shell, ...
7     [high, resolution, reservoir, characterization...
8     [untitled, journal, sedimentary, research, v, ...
9     [doi, j, epsl, www, elsevier, com, locate, eps...
10    [authors, dale, e, bird, department, geoscienc...
11    [spe, ms, spe, ms, taking, co2, enhanced, oil,...

删除数据框每个标记化行中的停用词

remove stopword in each tokenized row of a dataframe

python

nltk

stop-words

pandas