根据元素长度删除 python dataframe 中的字符串元素

Question

我有一个 python 数据框，由 13 列和 60000 行组成，其中一个名为“文本”（类型对象）的列包含相当长的文本单元格：

    Text    ID  AI  BI  GH  JB  EQ  HE  EN  MA  WE  WR
2585    obstetric gynaecologicaladmissions owing abor...    2585    0   0   0   0   0   1   0   0   0   0
507     graphic illustration process flow help organiz...   507     0   0   0   0   0   0   0   0   1   0

某些行中的某些单词被粘住了（例如第一个数据框行：gynaecologicaladmissions），为了摆脱这种情况，我想删除整个数据集中的所有这些案例。我考虑删除，对于“文本”列中的每一行，所有超过 13 个字符的单词

我试过这行代码：

res.loc[res['Text'].str.len() < 13]

但结果只提供了两个空行。

我该如何解决这个问题？

Answer 1

我们来看一个示例数据框

df

    text
0   obstetric gynaecologicaladmissions owing
1   graphic illustration process flow help
2   process flow help
3   illustrationprocess flow

因为你必须检查单词长度，所以你必须用分隔符（在本例中为 space）拆分每个字符串并循环遍历数组并包括那些长度 <= 13 的单词。遍历每个数组，你可以使用 apply

def func(x):
    res = list()
    for word in x:
        if len(word) <= 13:
            res.append(word)
    return " ".join(res)
    
df['text'] = df['text'].str.split().apply(func)
df
    
     text
0   obstetric owing
1   graphic illustration process flow help
2   process flow help
3   flow

根据元素长度删除 python dataframe 中的字符串元素

Delete string's elements in python dataframe according to elements length

python

nlp

string-length

dataframe