Pandas 聚合 - 如何根据事件日期查找最近的事件
Pandas Aggregation - how to find nearest event based on event dates
我有一长串事件在 pandas 数据框中,每个事件都有事件开始和结束日期。如何使用 pandas.groupby()
在每个城市和场地组合中找到“最近的”活动?
如果没有新的即将发生的事件,最近的事件可能是过去的事件。即在这种情况下数据框中的最新事件,恰好是过去发生的。
如果有多个即将发生的事件,未来最接近的事件将被视为最近的事件。
我试过 groupby.agg("max")
如下,但这总是会给出最远的未来事件:
dfp.groupby(['CITY', 'VENUE'], as_index=False).agg({"EVENT_START" : "max", "EVENT_END": "max"})
正在寻找一种方法来获取时间上最近的未来事件(如果没有未来事件,则获取过去最近的事件)。
示例数据:
EVENT_START,EVENT_END,Event Description,City,Venue
2/5/2016,3/12/2016,event 1,Chicago,Art Institute of Chicago
11/2/2014,12/2/2014,event 2,Los Angelos,Party Haus
1/25/2018,1/31/2018,event 3,Long Beach,Precious Lamb
8/26/2018,8/31/2018,event 4,West Columbia,New Brookland Tavern
11/20/2017,12/17/2017,event 5,Paris,Orsay Museum
6/26/2018,7/9/2018,event 6,Lahaina,Bamboo Fresh
4/3/2010,5/2/2010,event 7,Mitchell,The Corn Palace
9/21/2015,10/18/2015,event 8,San Diego,San Diego Zoo
1/4/2014,1/15/2014,event 9,Portland,Doug Fir Lounge
9/21/2019,9/26/2019,event 10,St. Louis,Krispy Kreme
3/15/2015,2/13/2018,event 11,Corvallis,The Beanery
9/23/2005,10/2/2005,event 12,San Jose,Winchester Mystery House
12/11/2019,12/14/2019,event 13,Chicago,Art Institute of Chicago
6/1/2013,6/26/2013,event 14,Los Angelos,Party Haus
7/10/2020,9/4/2020,event 15,Long Beach,Precious Lamb
10/18/2020,11/26/2020,event 16,West Columbia,New Brookland Tavern
5/14/2004,5/16/2004,event 17,Paris,Orsay Museum
11/16/2020,11/20/2020,event 18,Lahaina,Bamboo Fresh
7/19/2020,10/22/2020,event 19,Mitchell,The Corn Palace
11/1/2017,11/30/2017,event 20,San Diego,San Diego Zoo
7/31/2015,8/1/2015,event 21,Portland,Doug Fir Lounge
10/12/2012,10/20/2012,event 22,St. Louis,Krispy Kreme
2/28/2003,3/13/2003,event 23,Corvallis,The Beanery
9/16/2019,9/20/2019,event 24,San Jose,Winchester Mystery House
3/1/2022,4/1/2022,event 25,Chicago,Art Institute of Chicago
2/19/2009,2/25/2009,event 26,Los Angelos,Party Haus
4/16/2015,5/8/2015,event 27,Long Beach,Precious Lamb
9/7/2016,9/11/2016,event 28,West Columbia,New Brookland Tavern
8/4/2001,8/26/2001,event 29,Paris,Orsay Museum
4/27/2017,6/11/2017,event 30,Lahaina,Bamboo Fresh
5/21/2011,6/19/2011,event 31,Mitchell,The Corn Palace
6/3/2020,8/10/2020,event 32,San Diego,San Diego Zoo
10/29/2012,11/15/2012,event 33,Portland,Doug Fir Lounge
9/1/2027,10/15/2027,event 34,St. Louis,Krispy Kreme
6/23/2017,6/25/2017,event 35,Corvallis,The Beanery
4/25/2007,5/26/2007,event 36,San Jose,Winchester Mystery House
5/30/2003,7/1/2003,event 37,Chicago,Art Institute of Chicago
3/14/2008,4/12/2008,event 38,Los Angelos,Party Haus
5/29/2017,7/27/2017,event 39,Long Beach,Precious Lamb
1/31/2015,3/7/2015,event 40,West Columbia,New Brookland Tavern
4/1/2017,4/21/2017,event 41,Paris,Orsay Museum
12/29/2003,1/31/2004,event 42,Lahaina,Bamboo Fresh
7/3/2021,7/17/2021,event 43,Mitchell,The Corn Palace
9/1/2004,4/30/2005,event 44,San Diego,San Diego Zoo
10/14/2006,10/27/2006,event 45,Portland,Doug Fir Lounge
7/18/2017,7/19/2017,event 46,St. Louis,Krispy Kreme
6/1/2006,6/1/2006,event 47,Corvallis,The Beanery
10/1/2012,11/4/2012,event 48,San Jose,Winchester Mystery House
9/5/2011,9/19/2011,event 49,Chicago,Art Institute of Chicago
5/28/2020,6/2/2020,event 50,Los Angelos,Party Haus
3/1/2023,4/1/2023,event 51,Chicago,Art Institute of Chicago
结果应该是:
3/1/2022 4/1/2022 event 25 Chicago Art Institute of Chicago
5/28/2020 6/2/2020 event 50 Los Angelos Party Haus
7/10/2020 9/4/2020 event 15 Long Beach Precious Lamb
10/18/2020 11/26/2020 event 16 West Columbia New Brookland Tavern
11/20/2017 12/17/2017 event 5 Paris Orsay Museum
11/16/2020 11/20/2020 event 18 Lahaina Bamboo Fresh
7/3/2021 7/17/2021 event 43 Mitchell The Corn Palace
6/3/2020 8/10/2020 event 32 San Diego San Diego Zoo
7/31/2015 8/1/2015 event 21 Portland Doug Fir Lounge
9/1/2027 10/15/2027 event 34 St. Louis Krispy Kreme
6/23/2017 6/25/2017 event 35 Corvallis The Beanery
9/16/2019 9/20/2019 event 24 San Jose Winchester Mystery House
IIUC,
df.loc[df.assign(diff = df['EVENT_START'] - pd.Timestamp('now')).groupby(['City', 'Venue'])['diff'].idxmax()]
输出:
EVENT_START EVENT_END Event Description City Venue
24 2022-03-01 2022-04-01 event 25 Chicago Art Institute of Chicago
34 2017-06-23 2017-06-25 event 35 Corvallis The Beanery
17 2020-11-16 2020-11-20 event 18 Lahaina Bamboo Fresh
14 2020-07-10 2020-09-04 event 15 Long Beach Precious Lamb
49 2020-05-28 2020-06-02 event 50 Los Angelos Party Haus
42 2021-07-03 2021-07-17 event 43 Mitchell The Corn Palace
4 2017-11-20 2017-12-17 event 5 Paris Orsay Museum
20 2015-07-31 2015-08-01 event 21 Portland Doug Fir Lounge
31 2020-06-03 2020-08-10 event 32 San Diego San Diego Zoo
23 2019-09-16 2019-09-20 event 24 San Jose Winchester Mystery House
33 2027-09-01 2027-10-15 event 34 St. Louis Krispy Kreme
15 2020-10-18 2020-11-26 event 16 West Columbia New Brookland Tavern
好吧,如果事件已经过去,您可能想改用 EVENT_END:
df['diff'] = np.where(df['EVENT_START'] - pd.Timestamp('now') > pd.Timedelta(days=0),
df['EVENT_START'] - pd.Timestamp('now'),
df['EVENT_END'] - pd.Timestamp('now'))
df.loc[df.groupby(['City', 'Venue'])['diff'].idxmax()]
输出(更新事件 51):
EVENT_START EVENT_END Event Description City Venue diff
50 2023-03-01 2023-04-01 event 51 Chicago Art Institute of Chicago 777 days 02:40:47.022263
10 2015-03-15 2018-02-13 event 11 Corvallis The Beanery -1065 days +02:40:47.022263
17 2020-11-16 2020-11-20 event 18 Lahaina Bamboo Fresh -54 days +02:40:47.022263
14 2020-07-10 2020-09-04 event 15 Long Beach Precious Lamb -131 days +02:40:47.022263
49 2020-05-28 2020-06-02 event 50 Los Angelos Party Haus -225 days +02:40:47.022263
42 2021-07-03 2021-07-17 event 43 Mitchell The Corn Palace 171 days 02:40:47.022263
4 2017-11-20 2017-12-17 event 5 Paris Orsay Museum -1123 days +02:40:47.022263
20 2015-07-31 2015-08-01 event 21 Portland Doug Fir Lounge -1992 days +02:40:47.022263
31 2020-06-03 2020-08-10 event 32 San Diego San Diego Zoo -156 days +02:40:47.022263
23 2019-09-16 2019-09-20 event 24 San Jose Winchester Mystery House -481 days +02:40:47.022263
33 2027-09-01 2027-10-15 event 34 St. Louis Krispy Kreme 2422 days 02:40:47.022263
15 2020-10-18 2020-11-26 event 16 West Columbia New Brookland Tavern -48 days +02:40:47.022263
使用 sort_values
和 drop_duplicates
df['diff'] = np.where(df['EVENT_START'] - pd.Timestamp('now') > pd.Timedelta(days=0),
df['EVENT_START'] - pd.Timestamp('now'),
df['EVENT_END'] - pd.Timestamp('now'))
df.sort_values('diff', ascending=False).drop_duplicates(['City', 'Venue'])
输出:
EVENT_START EVENT_END Event Description City Venue diff
33 2027-09-01 2027-10-15 event 34 St. Louis Krispy Kreme 2422 days 02:40:47.022263
50 2023-03-01 2023-04-01 event 51 Chicago Art Institute of Chicago 777 days 02:40:47.022263
42 2021-07-03 2021-07-17 event 43 Mitchell The Corn Palace 171 days 02:40:47.022263
15 2020-10-18 2020-11-26 event 16 West Columbia New Brookland Tavern -48 days +02:40:47.022263
17 2020-11-16 2020-11-20 event 18 Lahaina Bamboo Fresh -54 days +02:40:47.022263
14 2020-07-10 2020-09-04 event 15 Long Beach Precious Lamb -131 days +02:40:47.022263
31 2020-06-03 2020-08-10 event 32 San Diego San Diego Zoo -156 days +02:40:47.022263
49 2020-05-28 2020-06-02 event 50 Los Angelos Party Haus -225 days +02:40:47.022263
23 2019-09-16 2019-09-20 event 24 San Jose Winchester Mystery House -481 days +02:40:47.022263
10 2015-03-15 2018-02-13 event 11 Corvallis The Beanery -1065 days +02:40:47.022263
4 2017-11-20 2017-12-17 event 5 Paris Orsay Museum -1123 days +02:40:47.022263
20 2015-07-31 2015-08-01 event 21 Portland Doug Fir Lounge -1992 days +02:40:47.022263
# split past events and future events
cond = df['EVENT_START'] > datetime.now()
df_furture = df[cond]
df_past = df[~cond]
# keep the nearest furture
dfn_furture = df_furture.sort_values(['City', 'Venue', 'EVENT_START'])\
.drop_duplicates(['City', 'Venue'], keep='first')
# merge one closest furture event for every city and the past events
dfn = pd.concat([dfn_furture, df_past])
df_result = dfn.sort_values(['City', 'Venue', 'EVENT_START'])\
.drop_duplicates(['City', 'Venue'], keep='last').sort_index()
结果:
EVENT_START EVENT_END Event Description City \
4 2017-11-20 2017-12-17 event 5 Paris
14 2020-07-10 2020-09-04 event 15 Long Beach
15 2020-10-18 2020-11-26 event 16 West Columbia
17 2020-11-16 2020-11-20 event 18 Lahaina
20 2015-07-31 2015-08-01 event 21 Portland
23 2019-09-16 2019-09-20 event 24 San Jose
24 2022-03-01 2022-04-01 event 25 Chicago
31 2020-06-03 2020-08-10 event 32 San Diego
33 2027-09-01 2027-10-15 event 34 St. Louis
34 2017-06-23 2017-06-25 event 35 Corvallis
42 2021-07-03 2021-07-17 event 43 Mitchell
49 2020-05-28 2020-06-02 event 50 Los Angelos
Venue
4 Orsay Museum
14 Precious Lamb
15 New Brookland Tavern
17 Bamboo Fresh
20 Doug Fir Lounge
23 Winchester Mystery House
24 Art Institute of Chicago
31 San Diego Zoo
33 Krispy Kreme
34 The Beanery
42 The Corn Palace
49 Party Haus
我有一长串事件在 pandas 数据框中,每个事件都有事件开始和结束日期。如何使用 pandas.groupby()
在每个城市和场地组合中找到“最近的”活动?
如果没有新的即将发生的事件,最近的事件可能是过去的事件。即在这种情况下数据框中的最新事件,恰好是过去发生的。 如果有多个即将发生的事件,未来最接近的事件将被视为最近的事件。
我试过 groupby.agg("max")
如下,但这总是会给出最远的未来事件:
dfp.groupby(['CITY', 'VENUE'], as_index=False).agg({"EVENT_START" : "max", "EVENT_END": "max"})
正在寻找一种方法来获取时间上最近的未来事件(如果没有未来事件,则获取过去最近的事件)。
示例数据:
EVENT_START,EVENT_END,Event Description,City,Venue
2/5/2016,3/12/2016,event 1,Chicago,Art Institute of Chicago
11/2/2014,12/2/2014,event 2,Los Angelos,Party Haus
1/25/2018,1/31/2018,event 3,Long Beach,Precious Lamb
8/26/2018,8/31/2018,event 4,West Columbia,New Brookland Tavern
11/20/2017,12/17/2017,event 5,Paris,Orsay Museum
6/26/2018,7/9/2018,event 6,Lahaina,Bamboo Fresh
4/3/2010,5/2/2010,event 7,Mitchell,The Corn Palace
9/21/2015,10/18/2015,event 8,San Diego,San Diego Zoo
1/4/2014,1/15/2014,event 9,Portland,Doug Fir Lounge
9/21/2019,9/26/2019,event 10,St. Louis,Krispy Kreme
3/15/2015,2/13/2018,event 11,Corvallis,The Beanery
9/23/2005,10/2/2005,event 12,San Jose,Winchester Mystery House
12/11/2019,12/14/2019,event 13,Chicago,Art Institute of Chicago
6/1/2013,6/26/2013,event 14,Los Angelos,Party Haus
7/10/2020,9/4/2020,event 15,Long Beach,Precious Lamb
10/18/2020,11/26/2020,event 16,West Columbia,New Brookland Tavern
5/14/2004,5/16/2004,event 17,Paris,Orsay Museum
11/16/2020,11/20/2020,event 18,Lahaina,Bamboo Fresh
7/19/2020,10/22/2020,event 19,Mitchell,The Corn Palace
11/1/2017,11/30/2017,event 20,San Diego,San Diego Zoo
7/31/2015,8/1/2015,event 21,Portland,Doug Fir Lounge
10/12/2012,10/20/2012,event 22,St. Louis,Krispy Kreme
2/28/2003,3/13/2003,event 23,Corvallis,The Beanery
9/16/2019,9/20/2019,event 24,San Jose,Winchester Mystery House
3/1/2022,4/1/2022,event 25,Chicago,Art Institute of Chicago
2/19/2009,2/25/2009,event 26,Los Angelos,Party Haus
4/16/2015,5/8/2015,event 27,Long Beach,Precious Lamb
9/7/2016,9/11/2016,event 28,West Columbia,New Brookland Tavern
8/4/2001,8/26/2001,event 29,Paris,Orsay Museum
4/27/2017,6/11/2017,event 30,Lahaina,Bamboo Fresh
5/21/2011,6/19/2011,event 31,Mitchell,The Corn Palace
6/3/2020,8/10/2020,event 32,San Diego,San Diego Zoo
10/29/2012,11/15/2012,event 33,Portland,Doug Fir Lounge
9/1/2027,10/15/2027,event 34,St. Louis,Krispy Kreme
6/23/2017,6/25/2017,event 35,Corvallis,The Beanery
4/25/2007,5/26/2007,event 36,San Jose,Winchester Mystery House
5/30/2003,7/1/2003,event 37,Chicago,Art Institute of Chicago
3/14/2008,4/12/2008,event 38,Los Angelos,Party Haus
5/29/2017,7/27/2017,event 39,Long Beach,Precious Lamb
1/31/2015,3/7/2015,event 40,West Columbia,New Brookland Tavern
4/1/2017,4/21/2017,event 41,Paris,Orsay Museum
12/29/2003,1/31/2004,event 42,Lahaina,Bamboo Fresh
7/3/2021,7/17/2021,event 43,Mitchell,The Corn Palace
9/1/2004,4/30/2005,event 44,San Diego,San Diego Zoo
10/14/2006,10/27/2006,event 45,Portland,Doug Fir Lounge
7/18/2017,7/19/2017,event 46,St. Louis,Krispy Kreme
6/1/2006,6/1/2006,event 47,Corvallis,The Beanery
10/1/2012,11/4/2012,event 48,San Jose,Winchester Mystery House
9/5/2011,9/19/2011,event 49,Chicago,Art Institute of Chicago
5/28/2020,6/2/2020,event 50,Los Angelos,Party Haus
3/1/2023,4/1/2023,event 51,Chicago,Art Institute of Chicago
结果应该是:
3/1/2022 4/1/2022 event 25 Chicago Art Institute of Chicago
5/28/2020 6/2/2020 event 50 Los Angelos Party Haus
7/10/2020 9/4/2020 event 15 Long Beach Precious Lamb
10/18/2020 11/26/2020 event 16 West Columbia New Brookland Tavern
11/20/2017 12/17/2017 event 5 Paris Orsay Museum
11/16/2020 11/20/2020 event 18 Lahaina Bamboo Fresh
7/3/2021 7/17/2021 event 43 Mitchell The Corn Palace
6/3/2020 8/10/2020 event 32 San Diego San Diego Zoo
7/31/2015 8/1/2015 event 21 Portland Doug Fir Lounge
9/1/2027 10/15/2027 event 34 St. Louis Krispy Kreme
6/23/2017 6/25/2017 event 35 Corvallis The Beanery
9/16/2019 9/20/2019 event 24 San Jose Winchester Mystery House
IIUC,
df.loc[df.assign(diff = df['EVENT_START'] - pd.Timestamp('now')).groupby(['City', 'Venue'])['diff'].idxmax()]
输出:
EVENT_START EVENT_END Event Description City Venue
24 2022-03-01 2022-04-01 event 25 Chicago Art Institute of Chicago
34 2017-06-23 2017-06-25 event 35 Corvallis The Beanery
17 2020-11-16 2020-11-20 event 18 Lahaina Bamboo Fresh
14 2020-07-10 2020-09-04 event 15 Long Beach Precious Lamb
49 2020-05-28 2020-06-02 event 50 Los Angelos Party Haus
42 2021-07-03 2021-07-17 event 43 Mitchell The Corn Palace
4 2017-11-20 2017-12-17 event 5 Paris Orsay Museum
20 2015-07-31 2015-08-01 event 21 Portland Doug Fir Lounge
31 2020-06-03 2020-08-10 event 32 San Diego San Diego Zoo
23 2019-09-16 2019-09-20 event 24 San Jose Winchester Mystery House
33 2027-09-01 2027-10-15 event 34 St. Louis Krispy Kreme
15 2020-10-18 2020-11-26 event 16 West Columbia New Brookland Tavern
好吧,如果事件已经过去,您可能想改用 EVENT_END:
df['diff'] = np.where(df['EVENT_START'] - pd.Timestamp('now') > pd.Timedelta(days=0),
df['EVENT_START'] - pd.Timestamp('now'),
df['EVENT_END'] - pd.Timestamp('now'))
df.loc[df.groupby(['City', 'Venue'])['diff'].idxmax()]
输出(更新事件 51):
EVENT_START EVENT_END Event Description City Venue diff
50 2023-03-01 2023-04-01 event 51 Chicago Art Institute of Chicago 777 days 02:40:47.022263
10 2015-03-15 2018-02-13 event 11 Corvallis The Beanery -1065 days +02:40:47.022263
17 2020-11-16 2020-11-20 event 18 Lahaina Bamboo Fresh -54 days +02:40:47.022263
14 2020-07-10 2020-09-04 event 15 Long Beach Precious Lamb -131 days +02:40:47.022263
49 2020-05-28 2020-06-02 event 50 Los Angelos Party Haus -225 days +02:40:47.022263
42 2021-07-03 2021-07-17 event 43 Mitchell The Corn Palace 171 days 02:40:47.022263
4 2017-11-20 2017-12-17 event 5 Paris Orsay Museum -1123 days +02:40:47.022263
20 2015-07-31 2015-08-01 event 21 Portland Doug Fir Lounge -1992 days +02:40:47.022263
31 2020-06-03 2020-08-10 event 32 San Diego San Diego Zoo -156 days +02:40:47.022263
23 2019-09-16 2019-09-20 event 24 San Jose Winchester Mystery House -481 days +02:40:47.022263
33 2027-09-01 2027-10-15 event 34 St. Louis Krispy Kreme 2422 days 02:40:47.022263
15 2020-10-18 2020-11-26 event 16 West Columbia New Brookland Tavern -48 days +02:40:47.022263
使用 sort_values
和 drop_duplicates
df['diff'] = np.where(df['EVENT_START'] - pd.Timestamp('now') > pd.Timedelta(days=0),
df['EVENT_START'] - pd.Timestamp('now'),
df['EVENT_END'] - pd.Timestamp('now'))
df.sort_values('diff', ascending=False).drop_duplicates(['City', 'Venue'])
输出:
EVENT_START EVENT_END Event Description City Venue diff
33 2027-09-01 2027-10-15 event 34 St. Louis Krispy Kreme 2422 days 02:40:47.022263
50 2023-03-01 2023-04-01 event 51 Chicago Art Institute of Chicago 777 days 02:40:47.022263
42 2021-07-03 2021-07-17 event 43 Mitchell The Corn Palace 171 days 02:40:47.022263
15 2020-10-18 2020-11-26 event 16 West Columbia New Brookland Tavern -48 days +02:40:47.022263
17 2020-11-16 2020-11-20 event 18 Lahaina Bamboo Fresh -54 days +02:40:47.022263
14 2020-07-10 2020-09-04 event 15 Long Beach Precious Lamb -131 days +02:40:47.022263
31 2020-06-03 2020-08-10 event 32 San Diego San Diego Zoo -156 days +02:40:47.022263
49 2020-05-28 2020-06-02 event 50 Los Angelos Party Haus -225 days +02:40:47.022263
23 2019-09-16 2019-09-20 event 24 San Jose Winchester Mystery House -481 days +02:40:47.022263
10 2015-03-15 2018-02-13 event 11 Corvallis The Beanery -1065 days +02:40:47.022263
4 2017-11-20 2017-12-17 event 5 Paris Orsay Museum -1123 days +02:40:47.022263
20 2015-07-31 2015-08-01 event 21 Portland Doug Fir Lounge -1992 days +02:40:47.022263
# split past events and future events
cond = df['EVENT_START'] > datetime.now()
df_furture = df[cond]
df_past = df[~cond]
# keep the nearest furture
dfn_furture = df_furture.sort_values(['City', 'Venue', 'EVENT_START'])\
.drop_duplicates(['City', 'Venue'], keep='first')
# merge one closest furture event for every city and the past events
dfn = pd.concat([dfn_furture, df_past])
df_result = dfn.sort_values(['City', 'Venue', 'EVENT_START'])\
.drop_duplicates(['City', 'Venue'], keep='last').sort_index()
结果:
EVENT_START EVENT_END Event Description City \
4 2017-11-20 2017-12-17 event 5 Paris
14 2020-07-10 2020-09-04 event 15 Long Beach
15 2020-10-18 2020-11-26 event 16 West Columbia
17 2020-11-16 2020-11-20 event 18 Lahaina
20 2015-07-31 2015-08-01 event 21 Portland
23 2019-09-16 2019-09-20 event 24 San Jose
24 2022-03-01 2022-04-01 event 25 Chicago
31 2020-06-03 2020-08-10 event 32 San Diego
33 2027-09-01 2027-10-15 event 34 St. Louis
34 2017-06-23 2017-06-25 event 35 Corvallis
42 2021-07-03 2021-07-17 event 43 Mitchell
49 2020-05-28 2020-06-02 event 50 Los Angelos
Venue
4 Orsay Museum
14 Precious Lamb
15 New Brookland Tavern
17 Bamboo Fresh
20 Doug Fir Lounge
23 Winchester Mystery House
24 Art Institute of Chicago
31 San Diego Zoo
33 Krispy Kreme
34 The Beanery
42 The Corn Palace
49 Party Haus