使用正向填充 Python 以 1 分钟的频率分组并重新采样
Groupby and resample at 1min frequency using forward fill in Python
我想使用前向填充 ffill
以 1min
的频率对 data
列重新采样,同时按 id
列对 df
进行分组:
df
:
id timestamp data
1 1 2017-01-02 13:14:53.040 10.0
2 1 2017-01-02 16:04:43.240 11.0
...
4 2 2017-01-02 15:22:06.540 1.0
5 2 2017-01-03 13:55:34.240 2.0
...
预期输出:
id timestamp data
1 1 2017-01-02 13:14:53.040 10.0
2017-01-02 13:14:54.040 10.0
2017-01-02 13:14:55.040 10.0
2017-01-02 13:14:56.040 10.0
...
2 1 2017-01-02 16:04:43.240 11.0
2017-01-02 16:04:44.240 11.0
2017-01-02 16:04:45.240 11.0
2017-01-02 16:04:46.240 11.0
...
4 2 2017-01-02 15:22:06.540 1.0
2017-01-02 15:22:07.540 1.0
2017-01-02 15:22:08.540 1.0
2017-01-02 15:22:09.540 1.0
...
5 2 2017-01-03 13:55:34.240 2.0
2017-01-03 13:55:35.240 2.0
2017-01-03 13:55:36.240 2.0
2017-01-03 13:55:37.240 2.0
...
类似 但我试过:
df.set_index('timestamp').groupby('id').resample('1min').asfreq().drop(['id'], 1).reset_index()
和 data
列仅返回 NaN
个值:
id timestamp data
0 1 2017-01-02 13:14:53.040 NaN
1 1 2017-01-02 13:14:54.040 NaN
2 1 2017-01-02 13:14:55.040 NaN
3 1 2017-01-02 13:14:56.040 NaN
4 1 2017-01-02 13:14:57.040 NaN
... ... ... ...
编辑:
df
timestamp
的第二行从 2017-01-02 12:04:43.240
更改为 2017-01-02 16:04:43.240
,即属于相同 id
的行应该排序。
- 我误以为预期输出中的 min 是 second,但@jezrael 的回答是正确的。
使用自定义函数定义需要新行的数量 Timedelta
和 date_range
and DataFrame.reindex
:
def f(x):
new = x.index[0] + pd.Timedelta(5, unit='Min')
r = pd.date_range(x.index[0], new, freq='Min')
return x.reindex(r, method='ffill')
df = (df.reset_index()
.set_index('timestamp')
.groupby(['index','id'], sort=False)['data']
.apply(f)
.reset_index(level=0, drop=True)
.rename_axis(['id','timestamp'])
.reset_index()
)
print (df)
id timestamp data
0 1 2017-01-02 13:14:53.040 10.0
1 1 2017-01-02 13:15:53.040 10.0
2 1 2017-01-02 13:16:53.040 10.0
3 1 2017-01-02 13:17:53.040 10.0
4 1 2017-01-02 13:18:53.040 10.0
5 1 2017-01-02 13:19:53.040 10.0
6 1 2017-01-02 12:04:43.240 11.0
7 1 2017-01-02 12:05:43.240 11.0
8 1 2017-01-02 12:06:43.240 11.0
9 1 2017-01-02 12:07:43.240 11.0
10 1 2017-01-02 12:08:43.240 11.0
11 1 2017-01-02 12:09:43.240 11.0
12 2 2017-01-02 15:22:06.540 1.0
13 2 2017-01-02 15:23:06.540 1.0
14 2 2017-01-02 15:24:06.540 1.0
15 2 2017-01-02 15:25:06.540 1.0
16 2 2017-01-02 15:26:06.540 1.0
17 2 2017-01-02 15:27:06.540 1.0
18 2 2017-01-03 13:55:34.240 2.0
19 2 2017-01-03 13:56:34.240 2.0
20 2 2017-01-03 13:57:34.240 2.0
21 2 2017-01-03 13:58:34.240 2.0
22 2 2017-01-03 13:59:34.240 2.0
23 2 2017-01-03 14:00:34.240 2.0
因为如果使用 ffill
输出不同:
df = df.set_index('timestamp').groupby('id', sort=False)['data'].resample('1min').ffill()
print (df)
id timestamp
1 2017-01-02 12:04:00 NaN
2017-01-02 12:05:00 11.0
2017-01-02 12:06:00 11.0
2017-01-02 12:07:00 11.0
2017-01-02 12:08:00 11.0
2 2017-01-03 13:51:00 1.0
2017-01-03 13:52:00 1.0
2017-01-03 13:53:00 1.0
2017-01-03 13:54:00 1.0
2017-01-03 13:55:00 1.0
Name: data, Length: 1425, dtype: float64
我想使用前向填充 ffill
以 1min
的频率对 data
列重新采样,同时按 id
列对 df
进行分组:
df
:
id timestamp data
1 1 2017-01-02 13:14:53.040 10.0
2 1 2017-01-02 16:04:43.240 11.0
...
4 2 2017-01-02 15:22:06.540 1.0
5 2 2017-01-03 13:55:34.240 2.0
...
预期输出:
id timestamp data
1 1 2017-01-02 13:14:53.040 10.0
2017-01-02 13:14:54.040 10.0
2017-01-02 13:14:55.040 10.0
2017-01-02 13:14:56.040 10.0
...
2 1 2017-01-02 16:04:43.240 11.0
2017-01-02 16:04:44.240 11.0
2017-01-02 16:04:45.240 11.0
2017-01-02 16:04:46.240 11.0
...
4 2 2017-01-02 15:22:06.540 1.0
2017-01-02 15:22:07.540 1.0
2017-01-02 15:22:08.540 1.0
2017-01-02 15:22:09.540 1.0
...
5 2 2017-01-03 13:55:34.240 2.0
2017-01-03 13:55:35.240 2.0
2017-01-03 13:55:36.240 2.0
2017-01-03 13:55:37.240 2.0
...
类似
df.set_index('timestamp').groupby('id').resample('1min').asfreq().drop(['id'], 1).reset_index()
和 data
列仅返回 NaN
个值:
id timestamp data
0 1 2017-01-02 13:14:53.040 NaN
1 1 2017-01-02 13:14:54.040 NaN
2 1 2017-01-02 13:14:55.040 NaN
3 1 2017-01-02 13:14:56.040 NaN
4 1 2017-01-02 13:14:57.040 NaN
... ... ... ...
编辑:
df
timestamp
的第二行从2017-01-02 12:04:43.240
更改为2017-01-02 16:04:43.240
,即属于相同id
的行应该排序。- 我误以为预期输出中的 min 是 second,但@jezrael 的回答是正确的。
使用自定义函数定义需要新行的数量 Timedelta
和 date_range
and DataFrame.reindex
:
def f(x):
new = x.index[0] + pd.Timedelta(5, unit='Min')
r = pd.date_range(x.index[0], new, freq='Min')
return x.reindex(r, method='ffill')
df = (df.reset_index()
.set_index('timestamp')
.groupby(['index','id'], sort=False)['data']
.apply(f)
.reset_index(level=0, drop=True)
.rename_axis(['id','timestamp'])
.reset_index()
)
print (df)
id timestamp data
0 1 2017-01-02 13:14:53.040 10.0
1 1 2017-01-02 13:15:53.040 10.0
2 1 2017-01-02 13:16:53.040 10.0
3 1 2017-01-02 13:17:53.040 10.0
4 1 2017-01-02 13:18:53.040 10.0
5 1 2017-01-02 13:19:53.040 10.0
6 1 2017-01-02 12:04:43.240 11.0
7 1 2017-01-02 12:05:43.240 11.0
8 1 2017-01-02 12:06:43.240 11.0
9 1 2017-01-02 12:07:43.240 11.0
10 1 2017-01-02 12:08:43.240 11.0
11 1 2017-01-02 12:09:43.240 11.0
12 2 2017-01-02 15:22:06.540 1.0
13 2 2017-01-02 15:23:06.540 1.0
14 2 2017-01-02 15:24:06.540 1.0
15 2 2017-01-02 15:25:06.540 1.0
16 2 2017-01-02 15:26:06.540 1.0
17 2 2017-01-02 15:27:06.540 1.0
18 2 2017-01-03 13:55:34.240 2.0
19 2 2017-01-03 13:56:34.240 2.0
20 2 2017-01-03 13:57:34.240 2.0
21 2 2017-01-03 13:58:34.240 2.0
22 2 2017-01-03 13:59:34.240 2.0
23 2 2017-01-03 14:00:34.240 2.0
因为如果使用 ffill
输出不同:
df = df.set_index('timestamp').groupby('id', sort=False)['data'].resample('1min').ffill()
print (df)
id timestamp
1 2017-01-02 12:04:00 NaN
2017-01-02 12:05:00 11.0
2017-01-02 12:06:00 11.0
2017-01-02 12:07:00 11.0
2017-01-02 12:08:00 11.0
2 2017-01-03 13:51:00 1.0
2017-01-03 13:52:00 1.0
2017-01-03 13:53:00 1.0
2017-01-03 13:54:00 1.0
2017-01-03 13:55:00 1.0
Name: data, Length: 1425, dtype: float64