如果存在特定单词，则将行值替换为 NaN - Python

Question

我正在清理数据框，我想检查数据框的单词列表中是否有任何值。如果存在，则该值应替换为 NA 值。例如，

我的数据框是这样的。

p['title']

1                                             Forest
2                                            [VIDEO_TITLE]
3                                            [VIDEO_TITLE]
4                                            [VIDEO_TITLE]
5                                [${title}url=${videourl}]


p.dtypes
title    object
dtype: object

和

c= ('${title}', '[VIDEO_TITLE]')

由于第 2、3、4、5 行有 c 中的单词，我希望将其替换为 NA 值。

我正在尝试以下操作，

p['title'].replace('|'.join(c),np.NAN,regex=True).fillna('NA')

这个运行没有错误，但我得到的输入与输出相同。完全没有变化。

我的下一次尝试是，

p['title'].apply(lambda x: 'NA' if any(s in x for s in c) else x)

这是一个错误，

TypeError: argument of type 'float' is not iterable

我正在尝试其他几件事，但都没有成功。我不确定我在做什么错误。

我理想的输出是，

p['title']

1     Forest
2        NA
3        NA
4        NA
5        NA

谁能帮我解决这个问题？

Answer 1

您可以loc将它们设置为'NA'。由于您的值有时在列表中，因此首先需要从列表中提取它们。第二行从列表中提取第一个字符串（如果它在列表中）。第三行检查是否匹配。

c = ('${title}', 'VIDEO_TITLE')
string_check = p['title'].map(lambda x: x if not isinstance(x, list) else x[0])
string_check = string_check.map(lambda s: any(c_str in s for c_str in c))
p.loc[string_check, 'title'] = 'NA'

根据您的操作，您可能需要考虑将值设置为 numpy.nan 而不是字符串 'NA'。这是 pandas 处理空值的常用方式，并且已经围绕此构建了许多功能。

Answer 2

>>> import pandas as pd
>>> import numpy as np

>>> df = pd.DataFrame({'A' : ('a','b','c', 'd', 'a', 'b', 'c')})
>>> restricted = ['a', 'b', 'c']
>>> df[df['A'].isin(restricted)] = np.NAN
>>> df
 A
0  NaN
1  NaN
2  NaN
3    d
4  NaN
5  NaN

如果存在特定单词，则将行值替换为 NaN - Python

Replace row value with NaN if particular word is present - Python

python

numpy

python-2.7

pandas

data-cleaning