在 pandas DataFrame 中添加缺失的行
Add missing rows in pandas DataFrame
我有一个 DataFrame
看起来像这样:
df = pd.DataFrame.from_dict({'id': [1, 2, 1, 1, 2, 3],
'reward': [0.1, 0.25, 0.15, 0.05, 0.4, 0.45],
'time': ['10:00:00', '12:00:00', '10:00:05', '10:00:07', '12:00:03', '15:00:00']} )
我想得到的是:
out = pd.DataFrame.from_dict({'id': [1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3],
'reward': [0.1, 0, 0, 0, 0, 0.15, 0.0, 0.05, 0.25, 0.0, 0.0, 0.4, 0.45],
'time': ['10:00:00', '10:00:01', '10:00:02', '10:00:03', '10:00:04', '10:00:05', '10:00:06', '10:00:07',
'12:00:00', '12:00:01', '12:00:02', '12:00:03', '15:00:00']} )
简而言之,对于每个 id,添加值 0 缺失的时间行。我该怎么做?我用循环写了一些东西,但是对于我有几百万行的用例来说,它会非常慢
这是使用 groupby.apply
的一种方法,我们使用 date_range
来添加缺失的时间。然后 merge
它回到 df
并填写其他列的缺失值:
df['time'] = pd.to_datetime(df['time'])
out = df.merge(df.groupby('id')['time'].apply(lambda x: pd.date_range(x.iat[0], x.iat[-1], freq='S')).explode(), how='right')
out['id'] = out['id'].ffill().astype(int)
out['reward'] = out['reward'].fillna(0)
输出:
id reward time
0 1 0.10 2022-04-23 10:00:00
1 1 0.00 2022-04-23 10:00:01
2 1 0.00 2022-04-23 10:00:02
3 1 0.00 2022-04-23 10:00:03
4 1 0.00 2022-04-23 10:00:04
5 1 0.15 2022-04-23 10:00:05
6 1 0.00 2022-04-23 10:00:06
7 1 0.05 2022-04-23 10:00:07
8 2 0.25 2022-04-23 12:00:00
9 2 0.00 2022-04-23 12:00:01
10 2 0.00 2022-04-23 12:00:02
11 2 0.40 2022-04-23 12:00:03
12 3 0.45 2022-04-23 15:00:00
一个选项是使用 complete from pyjanitor 来抽象流程:
# dev version has some performance improvements
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import pandas as pd
import janitor
df = df.astype({'time':np.datetime64})
# create mapping for expanded time
new_time = {'time' : lambda df: pd.date_range(df.min(), df.max(), freq='1S')}
# generate expanded rows
df.complete(new_time, by = 'id', fill_value = 0)
id reward time
0 1 0.10 2022-04-24 10:00:00
1 1 0.00 2022-04-24 10:00:01
2 1 0.00 2022-04-24 10:00:02
3 1 0.00 2022-04-24 10:00:03
4 1 0.00 2022-04-24 10:00:04
5 1 0.15 2022-04-24 10:00:05
6 1 0.00 2022-04-24 10:00:06
7 1 0.05 2022-04-24 10:00:07
8 2 0.25 2022-04-24 12:00:00
9 2 0.00 2022-04-24 12:00:01
10 2 0.00 2022-04-24 12:00:02
11 2 0.40 2022-04-24 12:00:03
12 3 0.45 2022-04-24 15:00:00
另一个可能更快的选项是使用 groupby、explode 和 merge 的组合:
# get the min and max dates
temp = df.groupby('id').time.agg(['min', 'max'])
# generate list of dates
outcome = [pd.date_range(start, end, freq='1S')
for start, end in
zip(temp['min'], temp['max'])]
outcome = pd.Series(outcome, index = temp.index).rename('time').explode()
# merge back to original df
(pd
.merge(outcome, df, on = ['id', 'time'], how = 'outer')
.fillna({'reward':0})
.loc[:, df.columns]
)
id reward time
0 1 0.10 2022-04-24 10:00:00
1 1 0.00 2022-04-24 10:00:01
2 1 0.00 2022-04-24 10:00:02
3 1 0.00 2022-04-24 10:00:03
4 1 0.00 2022-04-24 10:00:04
5 1 0.15 2022-04-24 10:00:05
6 1 0.00 2022-04-24 10:00:06
7 1 0.05 2022-04-24 10:00:07
8 2 0.25 2022-04-24 12:00:00
9 2 0.00 2022-04-24 12:00:01
10 2 0.00 2022-04-24 12:00:02
11 2 0.40 2022-04-24 12:00:03
12 3 0.45 2022-04-24 15:00:00
我有一个 DataFrame
看起来像这样:
df = pd.DataFrame.from_dict({'id': [1, 2, 1, 1, 2, 3],
'reward': [0.1, 0.25, 0.15, 0.05, 0.4, 0.45],
'time': ['10:00:00', '12:00:00', '10:00:05', '10:00:07', '12:00:03', '15:00:00']} )
我想得到的是:
out = pd.DataFrame.from_dict({'id': [1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3],
'reward': [0.1, 0, 0, 0, 0, 0.15, 0.0, 0.05, 0.25, 0.0, 0.0, 0.4, 0.45],
'time': ['10:00:00', '10:00:01', '10:00:02', '10:00:03', '10:00:04', '10:00:05', '10:00:06', '10:00:07',
'12:00:00', '12:00:01', '12:00:02', '12:00:03', '15:00:00']} )
简而言之,对于每个 id,添加值 0 缺失的时间行。我该怎么做?我用循环写了一些东西,但是对于我有几百万行的用例来说,它会非常慢
这是使用 groupby.apply
的一种方法,我们使用 date_range
来添加缺失的时间。然后 merge
它回到 df
并填写其他列的缺失值:
df['time'] = pd.to_datetime(df['time'])
out = df.merge(df.groupby('id')['time'].apply(lambda x: pd.date_range(x.iat[0], x.iat[-1], freq='S')).explode(), how='right')
out['id'] = out['id'].ffill().astype(int)
out['reward'] = out['reward'].fillna(0)
输出:
id reward time
0 1 0.10 2022-04-23 10:00:00
1 1 0.00 2022-04-23 10:00:01
2 1 0.00 2022-04-23 10:00:02
3 1 0.00 2022-04-23 10:00:03
4 1 0.00 2022-04-23 10:00:04
5 1 0.15 2022-04-23 10:00:05
6 1 0.00 2022-04-23 10:00:06
7 1 0.05 2022-04-23 10:00:07
8 2 0.25 2022-04-23 12:00:00
9 2 0.00 2022-04-23 12:00:01
10 2 0.00 2022-04-23 12:00:02
11 2 0.40 2022-04-23 12:00:03
12 3 0.45 2022-04-23 15:00:00
一个选项是使用 complete from pyjanitor 来抽象流程:
# dev version has some performance improvements
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import pandas as pd
import janitor
df = df.astype({'time':np.datetime64})
# create mapping for expanded time
new_time = {'time' : lambda df: pd.date_range(df.min(), df.max(), freq='1S')}
# generate expanded rows
df.complete(new_time, by = 'id', fill_value = 0)
id reward time
0 1 0.10 2022-04-24 10:00:00
1 1 0.00 2022-04-24 10:00:01
2 1 0.00 2022-04-24 10:00:02
3 1 0.00 2022-04-24 10:00:03
4 1 0.00 2022-04-24 10:00:04
5 1 0.15 2022-04-24 10:00:05
6 1 0.00 2022-04-24 10:00:06
7 1 0.05 2022-04-24 10:00:07
8 2 0.25 2022-04-24 12:00:00
9 2 0.00 2022-04-24 12:00:01
10 2 0.00 2022-04-24 12:00:02
11 2 0.40 2022-04-24 12:00:03
12 3 0.45 2022-04-24 15:00:00
另一个可能更快的选项是使用 groupby、explode 和 merge 的组合:
# get the min and max dates
temp = df.groupby('id').time.agg(['min', 'max'])
# generate list of dates
outcome = [pd.date_range(start, end, freq='1S')
for start, end in
zip(temp['min'], temp['max'])]
outcome = pd.Series(outcome, index = temp.index).rename('time').explode()
# merge back to original df
(pd
.merge(outcome, df, on = ['id', 'time'], how = 'outer')
.fillna({'reward':0})
.loc[:, df.columns]
)
id reward time
0 1 0.10 2022-04-24 10:00:00
1 1 0.00 2022-04-24 10:00:01
2 1 0.00 2022-04-24 10:00:02
3 1 0.00 2022-04-24 10:00:03
4 1 0.00 2022-04-24 10:00:04
5 1 0.15 2022-04-24 10:00:05
6 1 0.00 2022-04-24 10:00:06
7 1 0.05 2022-04-24 10:00:07
8 2 0.25 2022-04-24 12:00:00
9 2 0.00 2022-04-24 12:00:01
10 2 0.00 2022-04-24 12:00:02
11 2 0.40 2022-04-24 12:00:03
12 3 0.45 2022-04-24 15:00:00