如何折叠 pandas 中的空值列?
how to collapse columns in pandas on null values?
假设我有以下数据框:
pd.DataFrame({'col1': ["a", "a", np.nan, np.nan, np.nan],
'override1': ["b", np.nan, "b", np.nan, np.nan],
'override2': ["c", np.nan, np.nan, "c", np.nan]})
col1 override1 override2
0 a b c
1 a NaN NaN
2 NaN b NaN
3 NaN NaN c
4 NaN NaN NaN
有没有办法将 3 列合并为一列,其中 override2
覆盖 override1
,后者覆盖 col1
,但是,如果存在 NaN,则值bofore 是要保留的吗?另外,我主要是在寻找一种方法,这样我就不必再增加一个专栏了。我真的在寻找内置的 pandas 解决方案。
这是我正在寻找的输出:
collapsed
0 c
1 a
2 b
3 c
4 NaN
这是一种方法:
df.lookup(df.index , df.notna().cumsum(1).idxmax(1))
# array(['c', 'a', 'b', 'c', nan], dtype=object)
或者等效地使用底层 numpy
数组,并更改 idxmax
with ndarray.argmax
:
df.values[df.index, df.notna().cumsum(1).values.argmax(1)]
# array(['c', 'a', 'b', 'c', nan], dtype=object)
import pandas as pd
import numpy as np
df=pd.DataFrame({'col1': ["a", "a", np.nan, np.nan, np.nan],
'override1': ["b", np.nan, "b", np.nan, np.nan],
'override2': ["c", np.nan, np.nan, "c", np.nan]})
print(df)
df=df['col1'].fillna('') + df['override1'].fillna('')+ df['override2'].fillna('')
print(df)
一个简单的解决方案涉及向前填充和选择最后一列。评论中提到了这一点。
df.ffill(1).iloc[:,-1].to_frame(name='collapsed')
collapsed
0 c
1 a
2 b
3 c
4 NaN
如果您对性能感兴趣,我们可以使用 Divakar 的 justify 函数的修改版本:
pd.DataFrame({'collapsed': justify(
df.values, invalid_val=np.nan, axis=1, side='right')[:,-1]
})
collapsed
0 c
1 a
2 b
3 c
4 NaN
def justify(a, invalid_val=0, axis=1, side='left'):
"""
Justifies a 2D array
Parameters
----------
A : ndarray
Input array to be justified
axis : int
Axis along which justification is to be made
side : str
Direction of justification. It could be 'left', 'right', 'up', 'down'
It should be 'left' or 'right' for axis=1 and 'up' or 'down' for axis=0.
"""
if invalid_val is np.nan:
mask = pd.notna(a) # modified for strings
else:
mask = a!=invalid_val
justified_mask = np.sort(mask,axis=axis)
if (side=='up') | (side=='left'):
justified_mask = np.flip(justified_mask,axis=axis)
out = np.full(a.shape, invalid_val, dtype=a.dtype)
if axis==1:
out[justified_mask] = a[mask]
else:
out.T[justified_mask.T] = a.T[mask.T]
return out
关注性能,这里是 NumPy -
In [106]: idx = df.shape[1] - 1 - df.notnull().to_numpy()[:,::-1].argmax(1)
In [107]: pd.Series(df.to_numpy()[np.arange(len(df)),idx])
Out[107]:
0 c
1 a
2 b
3 c
4 NaN
dtype: object
性能 不是 而是美观和优雅 (-:
df.stack().groupby(level=0).last().reindex(df.index)
0 c
1 a
2 b
3 c
4 NaN
dtype: object
使用 ffill
df.ffill(1).iloc[:,-1]
假设我有以下数据框:
pd.DataFrame({'col1': ["a", "a", np.nan, np.nan, np.nan],
'override1': ["b", np.nan, "b", np.nan, np.nan],
'override2': ["c", np.nan, np.nan, "c", np.nan]})
col1 override1 override2
0 a b c
1 a NaN NaN
2 NaN b NaN
3 NaN NaN c
4 NaN NaN NaN
有没有办法将 3 列合并为一列,其中 override2
覆盖 override1
,后者覆盖 col1
,但是,如果存在 NaN,则值bofore 是要保留的吗?另外,我主要是在寻找一种方法,这样我就不必再增加一个专栏了。我真的在寻找内置的 pandas 解决方案。
这是我正在寻找的输出:
collapsed
0 c
1 a
2 b
3 c
4 NaN
这是一种方法:
df.lookup(df.index , df.notna().cumsum(1).idxmax(1))
# array(['c', 'a', 'b', 'c', nan], dtype=object)
或者等效地使用底层 numpy
数组,并更改 idxmax
with ndarray.argmax
:
df.values[df.index, df.notna().cumsum(1).values.argmax(1)]
# array(['c', 'a', 'b', 'c', nan], dtype=object)
import pandas as pd
import numpy as np
df=pd.DataFrame({'col1': ["a", "a", np.nan, np.nan, np.nan],
'override1': ["b", np.nan, "b", np.nan, np.nan],
'override2': ["c", np.nan, np.nan, "c", np.nan]})
print(df)
df=df['col1'].fillna('') + df['override1'].fillna('')+ df['override2'].fillna('')
print(df)
一个简单的解决方案涉及向前填充和选择最后一列。评论中提到了这一点。
df.ffill(1).iloc[:,-1].to_frame(name='collapsed')
collapsed
0 c
1 a
2 b
3 c
4 NaN
如果您对性能感兴趣,我们可以使用 Divakar 的 justify 函数的修改版本:
pd.DataFrame({'collapsed': justify(
df.values, invalid_val=np.nan, axis=1, side='right')[:,-1]
})
collapsed
0 c
1 a
2 b
3 c
4 NaN
def justify(a, invalid_val=0, axis=1, side='left'): """ Justifies a 2D array Parameters ---------- A : ndarray Input array to be justified axis : int Axis along which justification is to be made side : str Direction of justification. It could be 'left', 'right', 'up', 'down' It should be 'left' or 'right' for axis=1 and 'up' or 'down' for axis=0. """ if invalid_val is np.nan: mask = pd.notna(a) # modified for strings else: mask = a!=invalid_val justified_mask = np.sort(mask,axis=axis) if (side=='up') | (side=='left'): justified_mask = np.flip(justified_mask,axis=axis) out = np.full(a.shape, invalid_val, dtype=a.dtype) if axis==1: out[justified_mask] = a[mask] else: out.T[justified_mask.T] = a.T[mask.T] return out
关注性能,这里是 NumPy -
In [106]: idx = df.shape[1] - 1 - df.notnull().to_numpy()[:,::-1].argmax(1)
In [107]: pd.Series(df.to_numpy()[np.arange(len(df)),idx])
Out[107]:
0 c
1 a
2 b
3 c
4 NaN
dtype: object
性能 不是 而是美观和优雅 (-:
df.stack().groupby(level=0).last().reindex(df.index)
0 c
1 a
2 b
3 c
4 NaN
dtype: object
使用 ffill
df.ffill(1).iloc[:,-1]