使用 pandas 查找具有 Null 的 2 列之间的差异
Find difference between 2 columns with Nulls using pandas
我想找出 pandas DataFrame 中两列 int 类型的区别。我正在使用 python 2.7。列如下 -
>>> df
INVOICED_QUANTITY QUANTITY_SHIPPED
0 15 NaN
1 20 NaN
2 7 NaN
3 7 NaN
4 7 NaN
现在,我想从 INVOICED_QUANTITY 中减去 QUANTITY_SHIPPED,然后执行以下操作 -
>>> df['Diff'] = df['QUANTITY_INVOICED'] - df['SHIPPED_QUANTITY']
>>> df
QUANTITY_INVOICED SHIPPED_QUANTITY Diff
0 15 NaN NaN
1 20 NaN NaN
2 7 NaN NaN
3 7 NaN NaN
4 7 NaN NaN
如何处理 NaN?我想得到以下结果,因为我希望将 NaN 视为 0(零)-
>>> df
QUANTITY_INVOICED SHIPPED_QUANTITY Diff
0 15 NaN 15
1 20 NaN 20
2 7 NaN 7
3 7 NaN 7
4 7 NaN 7
我不想做 df.fillna(0)
。总而言之,我会尝试类似以下的方法并且它有效但没有区别 -
>>> df['Sum'] = df[['QUANTITY_INVOICED', 'SHIPPED_QUANTITY']].sum(axis=1)
>>> df
INVOICED_QUANTITY QUANTITY_SHIPPED Diff Sum
0 15 NaN NaN 15
1 20 NaN NaN 20
2 7 NaN NaN 7
3 7 NaN NaN 7
4 7 NaN NaN 7
我认为用 0 简单地填充 NaN 会帮助你。
df['Diff'] = df['INVOICED_QUANTITY'] - df['QUANTITY_SHIPPED'].fillna(0)
Out[153]:
INVOICED_QUANTITY QUANTITY_SHIPPED Diff
0 15 NaN 15
1 20 NaN 20
2 7 NaN 7
3 7 NaN 7
4 7 NaN 7
您可以使用 sub
方法执行减法 - 此方法允许 NaN
值被视为指定值:
df['Diff'] = df['INVOICED_QUANTITY'].sub(df['QUANTITY_SHIPPED'], fill_value=0)
产生:
INVOICED_QUANTITY QUANTITY_SHIPPED Diff
0 15 NaN 15
1 20 NaN 20
2 7 NaN 7
3 7 NaN 7
4 7 NaN 7
另一种简洁的方法是 :填写列中的缺失值(创建列的副本)并照常减去。
这两种方法几乎相同,尽管sub
效率更高一些,因为它不需要提前生成列的副本;它只是填充缺失值 "on the fly":
In [46]: %timeit df['INVOICED_QUANTITY'] - df['QUANTITY_SHIPPED'].fillna(0)
10000 loops, best of 3: 144 µs per loop
In [47]: %timeit df['INVOICED_QUANTITY'].sub(df['QUANTITY_SHIPPED'], fill_value=0)
10000 loops, best of 3: 81.7 µs per loop
@Jianxun Li 的 @Alex Riley and @Jianxun Li do not work as intended when both columns are NaN. You can slightly revise 建议的解决方案(以牺牲一些计算时间为代价)来解决这个问题。
df['Diff'] = df['INVOICED_QUANTITY'].fillna(0) - df['QUANTITY_SHIPPED'].fillna(0)
我发布了几个选项进行比较:
data = {'C1': [1,2,np.nan,np.nan],
'C2': [6,np.nan,4,np.nan],
}
df = pd.DataFrame(data)
df['Dif']=df.C1-df.C2
df['Dif2']=df['C1'].sub(df['C2'], fill_value=0)
df['Dif3']=df['C1']-df['C2'].fillna(0)
df['Dif4']=df['C1'].fillna(0)-df['C2']
df['Dif5']=df['C1'].fillna(0)-df['C2'].fillna(0)
print (df)
产生
C1 C2 Dif Dif2 Dif3 Dif4 Dif5
0 1.000 6.000 -5.000 -5.000 -5.000 -5.000 -5.000
1 2.000 NaN NaN 2.000 2.000 NaN 2.000
2 NaN 4.000 NaN -4.000 NaN -4.000 -4.000
3 NaN NaN NaN NaN NaN NaN 0.000
我想找出 pandas DataFrame 中两列 int 类型的区别。我正在使用 python 2.7。列如下 -
>>> df
INVOICED_QUANTITY QUANTITY_SHIPPED
0 15 NaN
1 20 NaN
2 7 NaN
3 7 NaN
4 7 NaN
现在,我想从 INVOICED_QUANTITY 中减去 QUANTITY_SHIPPED,然后执行以下操作 -
>>> df['Diff'] = df['QUANTITY_INVOICED'] - df['SHIPPED_QUANTITY']
>>> df
QUANTITY_INVOICED SHIPPED_QUANTITY Diff
0 15 NaN NaN
1 20 NaN NaN
2 7 NaN NaN
3 7 NaN NaN
4 7 NaN NaN
如何处理 NaN?我想得到以下结果,因为我希望将 NaN 视为 0(零)-
>>> df
QUANTITY_INVOICED SHIPPED_QUANTITY Diff
0 15 NaN 15
1 20 NaN 20
2 7 NaN 7
3 7 NaN 7
4 7 NaN 7
我不想做 df.fillna(0)
。总而言之,我会尝试类似以下的方法并且它有效但没有区别 -
>>> df['Sum'] = df[['QUANTITY_INVOICED', 'SHIPPED_QUANTITY']].sum(axis=1)
>>> df
INVOICED_QUANTITY QUANTITY_SHIPPED Diff Sum
0 15 NaN NaN 15
1 20 NaN NaN 20
2 7 NaN NaN 7
3 7 NaN NaN 7
4 7 NaN NaN 7
我认为用 0 简单地填充 NaN 会帮助你。
df['Diff'] = df['INVOICED_QUANTITY'] - df['QUANTITY_SHIPPED'].fillna(0)
Out[153]:
INVOICED_QUANTITY QUANTITY_SHIPPED Diff
0 15 NaN 15
1 20 NaN 20
2 7 NaN 7
3 7 NaN 7
4 7 NaN 7
您可以使用 sub
方法执行减法 - 此方法允许 NaN
值被视为指定值:
df['Diff'] = df['INVOICED_QUANTITY'].sub(df['QUANTITY_SHIPPED'], fill_value=0)
产生:
INVOICED_QUANTITY QUANTITY_SHIPPED Diff
0 15 NaN 15
1 20 NaN 20
2 7 NaN 7
3 7 NaN 7
4 7 NaN 7
另一种简洁的方法是
这两种方法几乎相同,尽管sub
效率更高一些,因为它不需要提前生成列的副本;它只是填充缺失值 "on the fly":
In [46]: %timeit df['INVOICED_QUANTITY'] - df['QUANTITY_SHIPPED'].fillna(0)
10000 loops, best of 3: 144 µs per loop
In [47]: %timeit df['INVOICED_QUANTITY'].sub(df['QUANTITY_SHIPPED'], fill_value=0)
10000 loops, best of 3: 81.7 µs per loop
@Jianxun Li 的 @Alex Riley and @Jianxun Li do not work as intended when both columns are NaN. You can slightly revise
df['Diff'] = df['INVOICED_QUANTITY'].fillna(0) - df['QUANTITY_SHIPPED'].fillna(0)
我发布了几个选项进行比较:
data = {'C1': [1,2,np.nan,np.nan],
'C2': [6,np.nan,4,np.nan],
}
df = pd.DataFrame(data)
df['Dif']=df.C1-df.C2
df['Dif2']=df['C1'].sub(df['C2'], fill_value=0)
df['Dif3']=df['C1']-df['C2'].fillna(0)
df['Dif4']=df['C1'].fillna(0)-df['C2']
df['Dif5']=df['C1'].fillna(0)-df['C2'].fillna(0)
print (df)
产生
C1 C2 Dif Dif2 Dif3 Dif4 Dif5
0 1.000 6.000 -5.000 -5.000 -5.000 -5.000 -5.000
1 2.000 NaN NaN 2.000 2.000 NaN 2.000
2 NaN 4.000 NaN -4.000 NaN -4.000 -4.000
3 NaN NaN NaN NaN NaN NaN 0.000