根据组 pandas 回填列
backfill a column based on a group pandas
我正在使用以下数据框:
df = pd.DataFrame({"id": ['A', 'A', 'A', 'B', 'B', 'B', 'C','C' ],
"date": [pd.Timestamp(2015, 12, 30), pd.Timestamp(2016, 12, 30), pd.Timestamp(2018, 12, 30),pd.Timestamp(2015, 12, 30), pd.Timestamp(2016, 12, 30), pd.Timestamp(2018, 12, 30), pd.Timestamp(2016, 12, 30), pd.Timestamp(2019, 12, 30)],
"other_col": ['NA', 'NA', 'A444', 'NA', 'NA', 'B666', 'NA', 'C999'],
"other_col_1": [123, 123, 'NA', 0.765, 0.555, 'NA', 0.324, 'NA']})
我想要实现的是:为每个对应的组回填“other_col”条目,并删除“other_col”当它等于“[=]中的'NA'时23=]_1".
我试过 groupby bfill() 和 ffill() df.groupby('id')['other_col'].bfill()
但它不起作用。
生成的数据框应如下所示:
df_new = pd.DataFrame({"id": ['A', 'A', 'B', 'B', 'C' ],
"date": [pd.Timestamp(2015, 12, 30), pd.Timestamp(2016, 12, 30), pd.Timestamp(2015, 12, 30), pd.Timestamp(2016, 12, 30), pd.Timestamp(2016, 12, 30)],
"other_col": ['A444', 'A444', 'B666', 'B666', 'C999'],
"other_col_1": [123, 123, 0.765, 0.555, 0.324]})
IIUC,你可以这样做:
out = (
df.replace('NA', pd.NA) # ensure real NA
.assign(other_col=lambda d: d['other_col'].bfill()) # backfill other_col
.dropna(subset=['other_col_1']) # drop rows based on other_col_1
)
或者,每组bfill
:
(df.replace('NA', pd.NA)
.assign(other_col=lambda d: d.groupby(d['id'].str.replace('\d+', '', regex=True))
['other_col'].bfill())
.dropna(subset=['other_col_1'])
)
输出:
id date other_col other_col_1
0 A1 2015-12-30 A444 123
1 A2 2016-12-30 A444 123
3 B1 2015-12-30 B666 0.765
4 B2 2016-12-30 B666 0.555
6 C1 2016-12-30 C999 0.324
首先,将 'NA'
替换为真实的 NaN
值,然后 bfill
:
df = df.replace('NA', np.nan)
df = df.bfill()[df['other_col_1'].notna()]
输出:
>>> df
id date other_col other_col_1
0 A 2015-12-30 A444 123.000
1 A 2016-12-30 A444 123.000
3 B 2015-12-30 B666 0.765
4 B 2016-12-30 B666 0.555
6 C 2016-12-30 C999 0.324
我正在使用以下数据框:
df = pd.DataFrame({"id": ['A', 'A', 'A', 'B', 'B', 'B', 'C','C' ],
"date": [pd.Timestamp(2015, 12, 30), pd.Timestamp(2016, 12, 30), pd.Timestamp(2018, 12, 30),pd.Timestamp(2015, 12, 30), pd.Timestamp(2016, 12, 30), pd.Timestamp(2018, 12, 30), pd.Timestamp(2016, 12, 30), pd.Timestamp(2019, 12, 30)],
"other_col": ['NA', 'NA', 'A444', 'NA', 'NA', 'B666', 'NA', 'C999'],
"other_col_1": [123, 123, 'NA', 0.765, 0.555, 'NA', 0.324, 'NA']})
我想要实现的是:为每个对应的组回填“other_col”条目,并删除“other_col”当它等于“[=]中的'NA'时23=]_1".
我试过 groupby bfill() 和 ffill() df.groupby('id')['other_col'].bfill()
但它不起作用。
生成的数据框应如下所示:
df_new = pd.DataFrame({"id": ['A', 'A', 'B', 'B', 'C' ],
"date": [pd.Timestamp(2015, 12, 30), pd.Timestamp(2016, 12, 30), pd.Timestamp(2015, 12, 30), pd.Timestamp(2016, 12, 30), pd.Timestamp(2016, 12, 30)],
"other_col": ['A444', 'A444', 'B666', 'B666', 'C999'],
"other_col_1": [123, 123, 0.765, 0.555, 0.324]})
IIUC,你可以这样做:
out = (
df.replace('NA', pd.NA) # ensure real NA
.assign(other_col=lambda d: d['other_col'].bfill()) # backfill other_col
.dropna(subset=['other_col_1']) # drop rows based on other_col_1
)
或者,每组bfill
:
(df.replace('NA', pd.NA)
.assign(other_col=lambda d: d.groupby(d['id'].str.replace('\d+', '', regex=True))
['other_col'].bfill())
.dropna(subset=['other_col_1'])
)
输出:
id date other_col other_col_1
0 A1 2015-12-30 A444 123
1 A2 2016-12-30 A444 123
3 B1 2015-12-30 B666 0.765
4 B2 2016-12-30 B666 0.555
6 C1 2016-12-30 C999 0.324
首先,将 'NA'
替换为真实的 NaN
值,然后 bfill
:
df = df.replace('NA', np.nan)
df = df.bfill()[df['other_col_1'].notna()]
输出:
>>> df
id date other_col other_col_1
0 A 2015-12-30 A444 123.000
1 A 2016-12-30 A444 123.000
3 B 2015-12-30 B666 0.765
4 B 2016-12-30 B666 0.555
6 C 2016-12-30 C999 0.324