从数据框列中删除无意义的单词

Question

dataframe 列包含的句子很少有三个和两个字母的单词没有意义。我想在数据框列中找到所有这些单词，然后将它们从数据框列中删除。 df-

id      text
1       happy birthday syz
2       vz
3       have a good bne weekend

我想 1) 找到所有长度小于 3 的单词。（这应该 return syz, vz, bne） 2）删除这些词（请注意，停用词已被删除，因此像 "a"、"the" 这样的词现在不存在于数据框列中，上面的数据框只是一个示例）

我尝试了下面的代码，但它不起作用

def word_length(text):
    words = []
    for word in text:
        if len(word) <= 3:
            words.append(word)
    return(words)

short_words = df['text'].apply(word_length).sum()

输出应该是-

id      text
1       happy birthday 
2       
3       have good weekend

Answer 1

您将函数应用于一列单词序列，而实际数据是一列字符串（符号序列）您还应该删除 .sum() 因为它是完全多余的。

重写你在表单中应用的函数：

 def filter_short_words(text):
    return "".join([for w in text.split() if len(w) > 3])

这行得通。

从数据框列中删除无意义的单词

Remove meaningless words from dataframe column

python

text-processing

nlp