Python Pandas:仅基于其中一列重新采样

Python Pandas: resample based on just one of the columns

我有以下数据,我正在对数据重新采样以了解每 15 分钟有多少辆自行车到达每个站点。但是,我的代码也在聚合我的站,我只想聚合变量“dtm_end_trip”

示例数据:

id_trip dtm_start_trip dtm_end_trip start_station end_station
1 2018-10-01 10:15:00 2018-10-01 10:17:00 A B
2 2018-10-01 10:17:00 2018-10-01 10:18:00 B A
... ... ... ... ...
999999 2021-12-31 23:58:00 2022-01-01 00:22:00 C A
1000000 2021-12-31 23:59:00 2022-01-01 00:29:00 A D

试用码:

df2 =  df(['end_station', 'dtm_end_trip']).size().to_frame(name = 'count').reset_index()
df2 = df2.sort_values(by='count', ascending=False)

df2= df2.set_index('dtm_end_trip')

df2 = df2.resample('15T').count()

我得到的输出:

dtm_end_trip end_station count
2018-10-01 00:15:00 2 2
2018-10-01 00:30:00 0 0
2018-10-01 00:45:00 1 1
2018-10-01 01:00:00 2 2
2018-10-01 01:15:00 1 1

期望的输出:

dtm_end_trip end_station count
2018-10-01 00:15:00 A 2
2018-10-01 00:15:00 B 0
2018-10-01 00:15:00 C 1
2018-10-01 00:15:00 D 2
2018-10-01 00:30:00 A 3
2018-10-01 00:30:00 B 2

在这种情况下,上面 table 的计数列是用随机数构造的,其唯一目的是举例说明所需输出的架构。

您可以这样使用 pd.Grouper

out = df.groupby([
    pd.Grouper(freq='15min', key='dtm_end_trip'),
    'end_station',
]).size()

>>> out
dtm_end_trip         end_station
2018-10-01 10:15:00  A              1
                     B              1
2022-01-01 00:15:00  A              1
                     D              1
dtype: int64

结果是 Series,但您可以轻松地将其转换为 DataFrame,并根据您想要的输出使用相同的标题:

>>> out.to_frame('count').reset_index()
         dtm_end_trip end_station  count
0 2018-10-01 10:15:00           A      1
1 2018-10-01 10:15:00           B      1
2 2022-01-01 00:15:00           A      1
3 2022-01-01 00:15:00           D      1

注意:这是样本输入数据中四行的结果。