使用来自其他 Series/DataFrame（曲线下面积）的值过滤 DataFrame 中的每个 X

Question

我正在过滤 DataFrame 以获得曲线下的面积。我已经设法获得曲线的边界，这样我们只需要该曲线下的行。

我解决这个问题的方法是在下面的代码中使用 (1) 获取 data_y_border （图中的红色曲线）（这很好用).这将包含每个 X 的最顶层 Y，其中另一列的值 >= 0.7，这样我就可以查询 data_y_border[x_value] 并获得相应的最顶层 Y。

注：data_y_border不是整个数据集中Y的最低值。 data （图中的蓝色矩形） 是我们的数据集，data_y_border 是由 Density 列定义的红色区域的下边界，其中值高于 0.7：

    density_zone = data[
        (full_dataset["X" < x_right_boundary)
        & (full_dataset['Density'] >= 0.7)
        & (full_dataset['Y'] > y_lower_boundary)
    ]

data_y_border是红色区域的底部。它下面的任何东西都没有密度 > 0.7。

我现在想使用每个 X 位置的 Y 值来保留所有行其中 X 值对应于 Y <= 其最上面的 Y（indata_y_border）.

我在下面 [2] 中尝试组合使用 loc 和 lambda 来比较行值与每行最上面的 Y，但我收到错误消息：

ValueError: Can only compare identically-labeled Series objects

代码：

[1] data_y_border = density_zone.groupby("X")["Y"].min() #returns Series

                          or

    data_y_border = density_zone.loc[density_zone.groupby("X")["Y"].idxmin() # returns DataFrame
    # as per @enke's suggestion

[2] data.loc[lambda row: row['Y'] <= data_y_border.get(row['X'])]

    # get the X value for `row`, 
    # use it as the index in `data_y_border` to get the corresponding Y // value, 
    # compare that row's Y value to see if it's less than or equal to the topmost Y. 
    # If it is, keep it

DataFrame 中有大约 23 列，但作为示例，给定以下 data DataFrame 和 data_y_border，我希望保持以下预期：

data = 
X    Y        OtherDataIWantToKeep
2.0  307.0    ...
2.0  155.3    ...     
2.0  120.0    ...     
2.0  80.2     ...        
4.0  500.3    ...
4.0  270.8    ...
4.0  111.2    ...
4.0  78.23    ...
4.0  6.3      ...

data_y_border=
2.0, 155.3
4.0, 111.2

预期输出行（包括来自其他列的所有数据）：

X    Y        OtherDataIWantToKeep
2.0  155.3    ...     
2.0  120.0    ...     
2.0  80.2     ...        
4.0  111.2    ...
4.0  78.23    ...
4.0  6.3      ...

我尝试了涉及 .apply 的组合，但我在使用该方法时遇到了关键错误。我觉得问题出在上面代码的 data_y_border.get(row['X']) 部分，其中 Pandas 不喜欢运行对单独过滤器的查询，以便使用该值来过滤当前数据帧。

是否使用 loc 和 lambda 过滤 DataFrame 中的每一行以将每一行的值与另一个 DataFrame/Series 中映射出的值进行比较？

我已经考虑过 iterrows（如果它是 Python/JS 中的 Arrays/Lists 我会映射它们）但是对于相当大的 DataFrame 来说感觉太贵了

Answer 1

来自您的评论：

The curve is based on values from another column. It's basically rows where values for another column are greater than a certain value, find the lowest Y for each X. That becomes our curve boundary. Using that curve we want to find the rows in the area beneath the curve.

似乎 data_y_border 是独立于 data 计算的。所以让我们把它当作给定的（如问题中给出的那样）。然后我们可以 map 到 data['X'] 并与 data['Y'] 进行比较；然后筛选：

out = data[data['Y'] <= data['X'].map(data_y_border.set_index('X')['Y'])]

输出：

     X       Y OtherDataIWantToKeep
1  2.0  155.30                  ...
2  2.0  120.00                  ...
3  2.0   80.20                  ...
6  4.0  111.20                  ...
7  4.0   78.23                  ...
8  4.0    6.30                  ...

Answer 2

您不能在 Y and X 的同一数据框中从 data_y_mins 创建一个 data_y_mins_index and data_y_mins_values 吗？然后你可以像这样过滤：

data[data['Y']<=data['y_min_value']]

使用来自其他 Series/DataFrame（曲线下面积）的值过滤 DataFrame 中的每个 X

Filtering each X in DataFrame with values from other Series/DataFrame (area under curve)

python

lambda

dataframe

pandas

pandas-groupby