根据聚合值和多索引过滤 table
Filter table on aggregate value and multi-index
我有以下数据集:
df.head(7)
Origin Destination Date Quantity
0 Atlanta LA 2021-09-09 1
1 Atlanta LA 2021-09-11 4
2 Atlanta Chicago 2021-09-16 1
3 Atlanta Seattle 2021-09-27 12
4 Seattle LA 2021-09-29 2
5 Seattle Atlanta 2021-09-13 2
6 Seattle Newark 2021-09-17 7
此 table 表示在给定日期从给定起点发送到给定目的地的项目数量(数量)。 table 包含 1 个月的数据。此 table 阅读方式:
shipments = pd.read_csv('shipments.csv', parse_dates=['Date'])
使用装运数据,我可以创建一个新的汇总 table,显示本月每个 Origin 和 Dest 对之间的装运总量:
shipments_agg =raw_shipments.groupby(['Origin','Destination']).sum()
作为最后一步,我想根据 shipments
table 创建一个新的 table,其中一行(出发地、目的地、日期、数量)是仅当 (Origin,Destination) 对的总数量大于 50 时才包含在内。换句话说,仅当 (Origin,Destination) in shipments_agg
时才应包含一行 (Origin, Destination, Date, Quantity)数量大于 50。我不太确定如何完成此操作。
您可以使用聚合数据框中的索引来定位原始数据框中的值。
可能有一种方法可以在 mega-one-liner 中完成所有这些工作,因为可读性/故障排除问题,我不喜欢这种方法,但这里有一种分解步骤的方法:
In [67]: shipments = pd.read_clipboard()
In [68]: shipments
Out[68]:
Origin Destination Date Quantity
0 Atlanta LA 2021-09-09 1
1 Atlanta LA 2021-09-11 4
2 Atlanta Chicago 2021-09-16 1
3 Atlanta Seattle 2021-09-27 12
4 Seattle LA 2021-09-29 2
5 Seattle Atlanta 2021-09-13 2
6 Seattle Newark 2021-09-17 7
In [69]: shipments_agg = shipments.groupby(["Origin", "Destination"]).sum()
In [70]: shipments_agg
Out[70]:
Quantity
Origin Destination
Atlanta Chicago 1
LA 5
Seattle 12
Seattle Atlanta 2
LA 2
Newark 7
In [71]: # let's use a cutoff of 4
In [72]: hi_qty_shipments = shipments_agg[shipments_agg["Quantity"] > 4]
In [73]: hi_qty_shipments
Out[73]:
Quantity
Origin Destination
Atlanta LA 5
Seattle 12
Seattle Newark 7
In [74]: # now re-index the base dataframe and use this multi-index to retrieve what is desired
In [75]: shipments.set_index(["Origin", "Destination"], inplace=True)
In [76]: shipments.loc[hi_qty_shipments.index]
Out[76]:
Date Quantity
Origin Destination
Atlanta LA 2021-09-09 1
LA 2021-09-11 4
Seattle 2021-09-27 12
Seattle Newark 2021-09-17 7
我有以下数据集:
df.head(7)
Origin Destination Date Quantity
0 Atlanta LA 2021-09-09 1
1 Atlanta LA 2021-09-11 4
2 Atlanta Chicago 2021-09-16 1
3 Atlanta Seattle 2021-09-27 12
4 Seattle LA 2021-09-29 2
5 Seattle Atlanta 2021-09-13 2
6 Seattle Newark 2021-09-17 7
此 table 表示在给定日期从给定起点发送到给定目的地的项目数量(数量)。 table 包含 1 个月的数据。此 table 阅读方式:
shipments = pd.read_csv('shipments.csv', parse_dates=['Date'])
使用装运数据,我可以创建一个新的汇总 table,显示本月每个 Origin 和 Dest 对之间的装运总量:
shipments_agg =raw_shipments.groupby(['Origin','Destination']).sum()
作为最后一步,我想根据 shipments
table 创建一个新的 table,其中一行(出发地、目的地、日期、数量)是仅当 (Origin,Destination) 对的总数量大于 50 时才包含在内。换句话说,仅当 (Origin,Destination) in shipments_agg
时才应包含一行 (Origin, Destination, Date, Quantity)数量大于 50。我不太确定如何完成此操作。
您可以使用聚合数据框中的索引来定位原始数据框中的值。
可能有一种方法可以在 mega-one-liner 中完成所有这些工作,因为可读性/故障排除问题,我不喜欢这种方法,但这里有一种分解步骤的方法:
In [67]: shipments = pd.read_clipboard()
In [68]: shipments
Out[68]:
Origin Destination Date Quantity
0 Atlanta LA 2021-09-09 1
1 Atlanta LA 2021-09-11 4
2 Atlanta Chicago 2021-09-16 1
3 Atlanta Seattle 2021-09-27 12
4 Seattle LA 2021-09-29 2
5 Seattle Atlanta 2021-09-13 2
6 Seattle Newark 2021-09-17 7
In [69]: shipments_agg = shipments.groupby(["Origin", "Destination"]).sum()
In [70]: shipments_agg
Out[70]:
Quantity
Origin Destination
Atlanta Chicago 1
LA 5
Seattle 12
Seattle Atlanta 2
LA 2
Newark 7
In [71]: # let's use a cutoff of 4
In [72]: hi_qty_shipments = shipments_agg[shipments_agg["Quantity"] > 4]
In [73]: hi_qty_shipments
Out[73]:
Quantity
Origin Destination
Atlanta LA 5
Seattle 12
Seattle Newark 7
In [74]: # now re-index the base dataframe and use this multi-index to retrieve what is desired
In [75]: shipments.set_index(["Origin", "Destination"], inplace=True)
In [76]: shipments.loc[hi_qty_shipments.index]
Out[76]:
Date Quantity
Origin Destination
Atlanta LA 2021-09-09 1
LA 2021-09-11 4
Seattle 2021-09-27 12
Seattle Newark 2021-09-17 7