替换缺失值和不稳定值,Pythons
Replace missing and inconstant value, Pythons
有以下例子:
import pandas as pd
df = pd.DataFrame({ 'Column A' : ['null',20,30,40,'null'],'Column B' : [100,'null',30,50,'null']});
我需要一个 Python 函数,它接受两列并比较它们:
如果一列是缺失值,我们从另一列填充它。
如果两个值都是'NULL',我们保留'NULL'.
如果值不同(不一致),请将两个值替换为 'NULL'
return 有一个属性
运行 函数后的数据应该如下所示。
这是我目前所做的,我需要帮助来实施第 3 步
def myFunction(firAttribute,secAttribute):
x=df.ix[:,[firAttribute,secAttribute]]
x['new']=x[firAttribute].fillna(x[secAttribute])
x['new2']=x[secAttribute].fillna(x[firAttribute])
x['new'] =x['new'].fillna(x['new2'])
return x['new']
您可以先replace
null
to NaN
, then combine_first
NaN
between columns and last use boolean indexing匹配不同的列值并填充它们NaN
:
import pandas as pd
import numpy as np
df = pd.DataFrame({ 'Column A' : ['null',20,30,40,'null'],
'Column B' : [100,'null',30,50,'null']});
print df
Column A Column B
0 null 100
1 20 null
2 30 30
3 40 50
4 null null
#replace null to NaN
df = df.replace("null", np.nan)
print df
Column A Column B
0 NaN 100
1 20 NaN
2 30 30
3 40 50
4 NaN NaN
df['Column A'] = df['Column A'].combine_first(df['Column B'])
df['Column B'] = df['Column B'].combine_first(df['Column A'])
print df
Column A Column B
0 100 100
1 20 20
2 30 30
3 40 50
4 NaN NaN
#inconsistent values replace to NaN
df[df['Column A'] != df['Column B']] = np.nan
print df
Column A Column B
0 100 100
1 20 20
2 30 30
3 NaN NaN
4 NaN NaN
有以下例子:
import pandas as pd
df = pd.DataFrame({ 'Column A' : ['null',20,30,40,'null'],'Column B' : [100,'null',30,50,'null']});
我需要一个 Python 函数,它接受两列并比较它们:
如果一列是缺失值,我们从另一列填充它。
如果两个值都是'NULL',我们保留'NULL'.
如果值不同(不一致),请将两个值替换为 'NULL'
return 有一个属性
运行 函数后的数据应该如下所示。
这是我目前所做的,我需要帮助来实施第 3 步
def myFunction(firAttribute,secAttribute):
x=df.ix[:,[firAttribute,secAttribute]]
x['new']=x[firAttribute].fillna(x[secAttribute])
x['new2']=x[secAttribute].fillna(x[firAttribute])
x['new'] =x['new'].fillna(x['new2'])
return x['new']
您可以先replace
null
to NaN
, then combine_first
NaN
between columns and last use boolean indexing匹配不同的列值并填充它们NaN
:
import pandas as pd
import numpy as np
df = pd.DataFrame({ 'Column A' : ['null',20,30,40,'null'],
'Column B' : [100,'null',30,50,'null']});
print df
Column A Column B
0 null 100
1 20 null
2 30 30
3 40 50
4 null null
#replace null to NaN
df = df.replace("null", np.nan)
print df
Column A Column B
0 NaN 100
1 20 NaN
2 30 30
3 40 50
4 NaN NaN
df['Column A'] = df['Column A'].combine_first(df['Column B'])
df['Column B'] = df['Column B'].combine_first(df['Column A'])
print df
Column A Column B
0 100 100
1 20 20
2 30 30
3 40 50
4 NaN NaN
#inconsistent values replace to NaN
df[df['Column A'] != df['Column B']] = np.nan
print df
Column A Column B
0 100 100
1 20 20
2 30 30
3 NaN NaN
4 NaN NaN