使用应用动态条件的循环过滤 Pandas DataFrame

Question

我有一个包含两列的数据框。一列显示距离，另一列包含与一组距离关联的唯一 'trackIds'。

示例：

    trackId.      distance
    
      2.           17.452
      2.            8.650
      2.           10.392
      2.           11.667
      2.           23.551
      2.            9.881
      3.            6.052
      3.            7.241
      3.            8.459
      3.           22.644
      3.          126.890
      3.           12.442
      3.            5.891
      4.           44.781
      4.            7.657
      4.           36.781
      4.          224.001

我想做的是消除任何包含距离大峰值的 trackIds -- 大于 75 的峰值。在这个示例中，轨道 ID 3 和 4（及其所有相关距离）将从数据帧中删除，因为我们看到距离大于 75 的尖峰，因此我们只剩下一个包含轨道 ID 2 及其所有相关联的数据帧距离值。

这是我的代码：

    i = 0
    k = 1
    length = len(dataframe)
    while i < length: 
        if (dataframe.distance[k] - dataframe.distance[i]) > 75: 
        bad_id = dataframe.trackId[k]
        condition = dataframe.trackid != bad_id
        df2 = dataframe[condition]
    i+=1

我尝试使用能够遍历所有不同 trackId 的 while 循环，减去所有距离值并查看结果是否 > 75，如果是，则程序将该 trackId 与变量相关联'bad_id' 并将其用作过滤数据帧的条件，以仅包含不等于 bad_id(s) 的 trackId。

我一直收到 nameErrors，因为我不确定如何正确构建循环，而且我通常不确定这种方法是否有效。

Answer 1

我们可以使用diff to compute the difference between each row, then use groupby transform to check if there are any differences in the group gt 75。然后保留组，其中没有任何匹配项：

m = ~(df['distance'].diff().gt(75).groupby(df['trackId']).transform('any'))
filtered_df = df.loc[m, df.columns]

filtered_df:

    trackId  distance
0       2.0    17.452
1       2.0     8.650
2       2.0    10.392
3       2.0    11.667
4       2.0    23.551
5       2.0     9.881

作为 DataFrame 的步骤分解：

breakdown = pd.DataFrame({'diff': df['distance'].diff()})
breakdown['gt 75'] = breakdown['diff'].gt(75)
breakdown['groupby any'] = (
    breakdown['gt 75'].groupby(df['trackId']).transform('any')
)
breakdown['negation'] = ~breakdown['groupby any']
print(breakdown)

breakdown:

       diff  gt 75  groupby any  negation
0       NaN  False        False      True
1    -8.802  False        False      True
2     1.742  False        False      True
3     1.275  False        False      True
4    11.884  False        False      True
5   -13.670  False        False      True
6    -3.829  False         True     False
7     1.189  False         True     False
8     1.218  False         True     False
9    14.185  False         True     False
10  104.246   True         True     False  # Spike of more than 75
11 -114.448  False         True     False
12   -6.551  False         True     False
13   38.890  False         True     False
14  -37.124  False         True     False
15   29.124  False         True     False
16  187.220   True         True     False  # Spike of more than 75

使用应用动态条件的循环过滤 Pandas DataFrame

Filter Pandas DataFrame using a loop that is applying a dynamic conditional

python

filtering

while-loop

dataframe

pandas