替换缺失值和不稳定值,Pythons

Replace missing and inconstant value, Pythons

有以下例子:

 import pandas as pd
df = pd.DataFrame({ 'Column A' : ['null',20,30,40,'null'],'Column B' : [100,'null',30,50,'null']});

我需要一个 Python 函数,它接受两列并比较它们:

  1. 如果一列是缺失值,我们从另一列填充它。

  2. 如果两个值都是'NULL',我们保留'NULL'.

  3. 如果值不同(不一致),请将两个值替换为 'NULL'

  4. return 有一个属性

运行 函数后的数据应该如下所示。

这是我目前所做的,我需要帮助来实施第 3 步

def myFunction(firAttribute,secAttribute):
    x=df.ix[:,[firAttribute,secAttribute]]
    x['new']=x[firAttribute].fillna(x[secAttribute])
    x['new2']=x[secAttribute].fillna(x[firAttribute])
    x['new'] =x['new'].fillna(x['new2'])
    return x['new'] 

您可以先replace null to NaN, then combine_first NaN between columns and last use boolean indexing匹配不同的列值并填充它们NaN:

import pandas as pd
import numpy as np

df = pd.DataFrame({ 'Column A' : ['null',20,30,40,'null'],
                    'Column B' : [100,'null',30,50,'null']});
print df
  Column A Column B
0     null      100
1       20     null
2       30       30
3       40       50
4     null     null

#replace null to NaN
df = df.replace("null", np.nan)
print df
   Column A  Column B
0       NaN       100
1        20       NaN
2        30        30
3        40        50
4       NaN       NaN
df['Column A'] = df['Column A'].combine_first(df['Column B'])
df['Column B'] = df['Column B'].combine_first(df['Column A'])
print df
   Column A  Column B
0       100       100
1        20        20
2        30        30
3        40        50
4       NaN       NaN

#inconsistent values replace to NaN
df[df['Column A'] != df['Column B']] = np.nan
print df
   Column A  Column B
0       100       100
1        20        20
2        30        30
3       NaN       NaN
4       NaN       NaN