从每个子组中选择第一行 (pandas)

Selecting first row from each subgroup (pandas)

如何select距离最小的行子集,按datep列分组?

df
    v       p       distance    date
0   14.6    sst     22454.1     2021-12-30
1   14.9    sst     24454.1     2021-12-30
2   14.8    sst     33687.4     2021-12-30
3   1.67    wvht    23141.8     2021-12-30
4   1.9     wvht    24454.1     2021-12-30
5   1.8     wvht    24454.1     2021-12-30
6   1.7     wvht    23141.4     2021-12-31
7   2.1     wvht    24454.1     2021-12-31

理想情况下,返回的数据框应包含:

df
    v       p       distance    date
0   14.6    sst     22454.1     2021-12-30
3   1.67    wvht    23141.8     2021-12-30
6   1.7     wvht    23141.4     2021-12-31

一种方法是使用groupby + idxmin得到每组最小距离的索引,然后使用loc得到想要的输出:

out = df.loc[df.groupby(['date', 'p'])['distance'].idxmin()]

输出:

       v     p  distance        date
0  14.60   sst   22454.1  2021-12-30
3   1.67  wvht   23141.8  2021-12-30
6   1.70  wvht   23141.4  2021-12-31

按 p 和距离对值进行排序。删除所有重复项,保留每个 p 和 date

中的第一次出现
df.sort_values(by=['p', 'distance']).drop_duplicates(subset=['p','date'],keep='first')



 v     p  distance        date
0  14.60   sst   22454.1  2021-12-30
6   1.70  wvht   23141.4  2021-12-31
3   1.67  wvht   23141.8  2021-12-30

如果您不需要原始索引,那么您可以使用 .first().min('distance') 以及更高版本的 reset_index()

df.groupby(['date', 'p']).first().reset_index()

import pandas as pd

text = '''v       p       distance    date
0   14.6    sst     22454.1     2021-12-30
1   14.9    sst     24454.1     2021-12-30
2   14.8    sst     33687.4     2021-12-30
3   1.67    wvht    23141.8     2021-12-30
4   1.9     wvht    24454.1     2021-12-30
5   1.8     wvht    24454.1     2021-12-30
6   1.7     wvht    23141.4     2021-12-31
7   2.1     wvht    24454.1     2021-12-31'''

import io

df = pd.read_csv(io.StringIO(text), sep='\s+')

df.groupby(['date', 'p']).first().reset_index()

结果:

         date     p      v  distance
0  2021-12-30   sst  14.60   22454.1
1  2021-12-30  wvht   1.67   23141.8
2  2021-12-31  wvht   1.70   23141.4