根据聚合值和多索引过滤 table

Filter table on aggregate value and multi-index

我有以下数据集:

df.head(7)
     Origin        Destination     Date            Quantity
0   Atlanta        LA       2021-09-09      1
1   Atlanta        LA       2021-09-11      4
2   Atlanta        Chicago  2021-09-16      1
3   Atlanta        Seattle  2021-09-27      12
4   Seattle        LA       2021-09-29      2
5   Seattle        Atlanta  2021-09-13      2
6   Seattle        Newark   2021-09-17      7

此 table 表示在给定日期从给定起点发送到给定目的地的项目数量(数量)。 table 包含 1 个月的数据。此 table 阅读方式:

shipments = pd.read_csv('shipments.csv', parse_dates=['Date'])

使用装运数据,我可以创建一个新的汇总 table,显示本月每个 Origin 和 Dest 对之间的装运总量:

shipments_agg =raw_shipments.groupby(['Origin','Destination']).sum()

作为最后一步,我想根据 shipments table 创建一个新的 table,其中一行(出发地、目的地、日期、数量)是仅当 (Origin,Destination) 对的总数量大于 50 时才包含在内。换句话说,仅当 (Origin,Destination) in shipments_agg 时才应包含一行 (Origin, Destination, Date, Quantity)数量大于 50。我不太确定如何完成此操作。

您可以使用聚合数据框中的索引来定位原始数据框中的值。

可能有一种方法可以在 mega-one-liner 中完成所有这些工作,因为可读性/故障排除问题,我不喜欢这种方法,但这里有一种分解步骤的方法:

In [67]: shipments = pd.read_clipboard()

In [68]: shipments
Out[68]: 
    Origin Destination        Date  Quantity
0  Atlanta          LA  2021-09-09         1
1  Atlanta          LA  2021-09-11         4
2  Atlanta     Chicago  2021-09-16         1
3  Atlanta     Seattle  2021-09-27        12
4  Seattle          LA  2021-09-29         2
5  Seattle     Atlanta  2021-09-13         2
6  Seattle      Newark  2021-09-17         7

In [69]: shipments_agg = shipments.groupby(["Origin", "Destination"]).sum()

In [70]: shipments_agg
Out[70]: 
                     Quantity
Origin  Destination          
Atlanta Chicago             1
        LA                  5
        Seattle            12
Seattle Atlanta             2
        LA                  2
        Newark              7

In [71]: # let's use a cutoff of 4

In [72]: hi_qty_shipments = shipments_agg[shipments_agg["Quantity"] > 4]

In [73]: hi_qty_shipments
Out[73]: 
                     Quantity
Origin  Destination          
Atlanta LA                  5
        Seattle            12
Seattle Newark              7

In [74]: # now re-index the base dataframe and use this multi-index to retrieve what is desired

In [75]: shipments.set_index(["Origin", "Destination"], inplace=True)

In [76]: shipments.loc[hi_qty_shipments.index]
Out[76]: 
                           Date  Quantity
Origin  Destination                      
Atlanta LA           2021-09-09         1
        LA           2021-09-11         4
        Seattle      2021-09-27        12
Seattle Newark       2021-09-17         7