Python Pandas:仅基于其中一列重新采样
Python Pandas: resample based on just one of the columns
我有以下数据,我正在对数据重新采样以了解每 15 分钟有多少辆自行车到达每个站点。但是,我的代码也在聚合我的站,我只想聚合变量“dtm_end_trip”
示例数据:
id_trip
dtm_start_trip
dtm_end_trip
start_station
end_station
1
2018-10-01 10:15:00
2018-10-01 10:17:00
A
B
2
2018-10-01 10:17:00
2018-10-01 10:18:00
B
A
...
...
...
...
...
999999
2021-12-31 23:58:00
2022-01-01 00:22:00
C
A
1000000
2021-12-31 23:59:00
2022-01-01 00:29:00
A
D
试用码:
df2 = df(['end_station', 'dtm_end_trip']).size().to_frame(name = 'count').reset_index()
df2 = df2.sort_values(by='count', ascending=False)
df2= df2.set_index('dtm_end_trip')
df2 = df2.resample('15T').count()
我得到的输出:
dtm_end_trip
end_station
count
2018-10-01 00:15:00
2
2
2018-10-01 00:30:00
0
0
2018-10-01 00:45:00
1
1
2018-10-01 01:00:00
2
2
2018-10-01 01:15:00
1
1
期望的输出:
dtm_end_trip
end_station
count
2018-10-01 00:15:00
A
2
2018-10-01 00:15:00
B
0
2018-10-01 00:15:00
C
1
2018-10-01 00:15:00
D
2
2018-10-01 00:30:00
A
3
2018-10-01 00:30:00
B
2
在这种情况下,上面 table 的计数列是用随机数构造的,其唯一目的是举例说明所需输出的架构。
您可以这样使用 pd.Grouper
:
out = df.groupby([
pd.Grouper(freq='15min', key='dtm_end_trip'),
'end_station',
]).size()
>>> out
dtm_end_trip end_station
2018-10-01 10:15:00 A 1
B 1
2022-01-01 00:15:00 A 1
D 1
dtype: int64
结果是 Series
,但您可以轻松地将其转换为 DataFrame
,并根据您想要的输出使用相同的标题:
>>> out.to_frame('count').reset_index()
dtm_end_trip end_station count
0 2018-10-01 10:15:00 A 1
1 2018-10-01 10:15:00 B 1
2 2022-01-01 00:15:00 A 1
3 2022-01-01 00:15:00 D 1
注意:这是样本输入数据中四行的结果。
我有以下数据,我正在对数据重新采样以了解每 15 分钟有多少辆自行车到达每个站点。但是,我的代码也在聚合我的站,我只想聚合变量“dtm_end_trip”
示例数据:
id_trip | dtm_start_trip | dtm_end_trip | start_station | end_station |
---|---|---|---|---|
1 | 2018-10-01 10:15:00 | 2018-10-01 10:17:00 | A | B |
2 | 2018-10-01 10:17:00 | 2018-10-01 10:18:00 | B | A |
... | ... | ... | ... | ... |
999999 | 2021-12-31 23:58:00 | 2022-01-01 00:22:00 | C | A |
1000000 | 2021-12-31 23:59:00 | 2022-01-01 00:29:00 | A | D |
试用码:
df2 = df(['end_station', 'dtm_end_trip']).size().to_frame(name = 'count').reset_index()
df2 = df2.sort_values(by='count', ascending=False)
df2= df2.set_index('dtm_end_trip')
df2 = df2.resample('15T').count()
我得到的输出:
dtm_end_trip | end_station | count |
---|---|---|
2018-10-01 00:15:00 | 2 | 2 |
2018-10-01 00:30:00 | 0 | 0 |
2018-10-01 00:45:00 | 1 | 1 |
2018-10-01 01:00:00 | 2 | 2 |
2018-10-01 01:15:00 | 1 | 1 |
期望的输出:
dtm_end_trip | end_station | count |
---|---|---|
2018-10-01 00:15:00 | A | 2 |
2018-10-01 00:15:00 | B | 0 |
2018-10-01 00:15:00 | C | 1 |
2018-10-01 00:15:00 | D | 2 |
2018-10-01 00:30:00 | A | 3 |
2018-10-01 00:30:00 | B | 2 |
在这种情况下,上面 table 的计数列是用随机数构造的,其唯一目的是举例说明所需输出的架构。
您可以这样使用 pd.Grouper
:
out = df.groupby([
pd.Grouper(freq='15min', key='dtm_end_trip'),
'end_station',
]).size()
>>> out
dtm_end_trip end_station
2018-10-01 10:15:00 A 1
B 1
2022-01-01 00:15:00 A 1
D 1
dtype: int64
结果是 Series
,但您可以轻松地将其转换为 DataFrame
,并根据您想要的输出使用相同的标题:
>>> out.to_frame('count').reset_index()
dtm_end_trip end_station count
0 2018-10-01 10:15:00 A 1
1 2018-10-01 10:15:00 B 1
2 2022-01-01 00:15:00 A 1
3 2022-01-01 00:15:00 D 1
注意:这是样本输入数据中四行的结果。