停用词分割

Question

嗨，亲爱的，我有一个关于 nltk 停用词的问题：如果我做一个循环，停用词检查字母而不是单词。我怎样才能改变这种行为？一个例子：

import pandas as pd
import nltk

stopword = nltk.corpus.stopwords.words('italian')
pd.set_option('display.max_colwidth', None)

df = pd.read_csv('esempioTweet.csv', sep =',')

def remove_stop(text):
    text = [word for word in text if word not in stopword]
    return text
df['Testo_no_stop'] = df['Testo_token'].apply(lambda x: remove_stop(x))
df.head()

鉴于之前的专栏是这样的：

[covid, calano, i, nuovi, contagi, e, tamponi]

我希望得到这样的输出：

[covid, calano, nuovi, contagi, tamponi]

但我的输出如下：

[v,d,n, ...]

我知道停用词是作用于单个字母而不是整个词，为什么？我确定我的 remove_stop 函数以正确的方式工作，但为什么停用词以错误的方式运行？谢谢你们对我的耐心等待。

Answer 1

您的代码使用 for word in text 如果文本是字符串 returns 一次一个字母。

我简化了删除 pandas 的代码，因为它不相关 - 稍微更改了 remove_stop 以使用 word in text.split()，尽管我想 nltk 可能有一种方法可以将文本拆分成单词，也许你例如，应该使用它可能会删除 split() 不会删除的标点符号。

import nltk

stopwords = nltk.corpus.stopwords.words('italian')

phrase = "oggi piove e non esco"

def remove_stop(text):
    global stopwords
    text = [word for word in text.split() if word not in stopwords]
    return text

res = remove_stop(phrase)
print( f"{res=}" )

输出：

res=['oggi', 'piove', 'esco']

顺便说一句，我认为您不需要 lambda，只需使用：

df['Testo_no_stop'] = df['Testo_token'].apply(remove_stop)

别忘了您可以向 remove_stop() 之类的函数添加调试，这是使用 for 循环而不是不可调试推导式的一个很好的理由。

同样你可以打印stopwords来检查它是一个列表。是的。

停用词分割

Stopword segmentation

python

nlp

nltk

stop-words