清理 Pandas 中的列 - 乱码

Question

目的：清理我的 pandas 数据框中的 OneCol 列。 我做了什么：我导入了 NLTK，运行这段代码：

import nltk    
import collections
from nltk.corpus import words

for value in df_US['OneCol']:
    if value in words.words():
        df_US['Result']=df_US['Result'].iloc.append(value)

我也试过这个：

#df_US['Result'] = df_US[['OneCol']].apply(lambda x: x.words.words())

没有成功！

我的数据是这样的：

谢谢，我很感激你能给我的任何建议。

Answer 1

让我们定义一个测试数据帧：

import numpy as np
import pandas as pd

df = pd.DataFrame({
    'ID': [1,2,3, 4],
    'Country': [2,2,2,2],
    'Q1': ['', '', 'I like to CODE', ''],
    'Q2': ['Good', 'xxxx', '', 'some gibberish text: jgsldkgnlk'],
    'OneCol': ['good', 'xxxx', 'i like to code', 'some gibberish text: jgsldkgnlk']
})
df

这将给出以下数据框：

import nltk    
import collections
from nltk.corpus import words
nltk.download('words')

df['Result'] = df['OneCol'].apply(lambda x: ' '.join(list(set(x.split()) & set(words.words()))))

df

这将给出以下结果（删除未知词）：

如果您希望删除包含至少一个未知词的字段，可以使用以下方法：

df['Result'] = df['OneCol'].apply(lambda x: x if len(list(set(x.split()) & set(words.words()))) == len(set(x.split())) else None)

这将给出以下结果（如果包含未知单词，则删除该字段）：

请注意，此逻辑不考虑标点符号。如果文本包含标点符号，标点符号旁边的单词将无法识别。

清理 Pandas 中的列 - 乱码

Cleaning a column in Pandas - Gibberish

lambda

for-loop

nltk

pandas