从标记化中删除停用词

Question

我在从标记化中删除停用词时遇到问题。我已经对句子进行了标记，并将带有 pandas 的结果插入到名为“tweets_tokenize”的列中。问题是我有双括号（ [ ] ），结果只有一个并重复（详情见图片），如果使用第一个函数，停用词不起作用。但是，如果使用第二个功能是好的。能解释一下为什么吗？

from nltk.corpus import stopwords
stopwords_indonesia = stopwords.words('indonesian')

# First function
def stopwords_remover(words):
    words = df['tweets_tokenize']
    tweets_stopwords = []
    for word in words:
        if word not in stopwords_indonesia:
            tweets_stopwords.append(word)
    return tweets_stopwords

# Second function
def stopwords_remover(words):
    tweets_stopwords = []
    for word in words:
        if word not in stopwords_indonesia:
            tweets_stopwords.append(word)
    return tweets_stopwords

df['tweets_tokenize'].apply(stopwords_remover)
df.head()

使用第一个函数的结果。

使用第二个函数的结果。

Answer 1

第一个函数：

原因是行：

words = df['tweets_tokenize']

对象的类型是'pandas.core.series.Series'，一旦你在循环中迭代它，对象词就会是列表的类型。

当你在这一行附加它时：

tweets_stopwords.append(word)

您实际上是在附加一个列表，而不是一个词。这就是为什么你有一个用大列表包裹的列表。

第二个函数：

分别迭代每一行，对象类型为普通字符串。

总结：

当你使用pandas的apply函数时，最好像第二个函数一样逐行处理，而不是像第一个函数中提到的那样使用整列。

从标记化中删除停用词

Remove stopwords from tokenizaton

python

nlp

nltk

pandas