在 PANDAS 中是否有一个分组,但前提是它有一个特定的数字

Is there a grouping by in PANDAS but only if it has a specific number

我在使用我的 Dataframe 时遇到了一些问题。我在下面有以下 DF。我试图分组,一行用“-”分隔,其他只是简单地\n。我遇到的问题是我需要连续有一定数量的数字(最少 4 个)。

   a      b  c
0  a  Num_1  0
1  a  Num_1  1
2  a  Num_1  2
3  a  Num_2  5
4  a  Num_2  6
5  a  Num_2  7
6  a  Num_2  8
7  a  Num_2  9

我编写了以下代码:

def split_by_threshold(li):
    inds = [0]+[ind for ind,(i,j) in enumerate(zip(li,li[1:]),1) if j-i != 1]+[len(li)+1]
    rez = [li[i:j] for i,j in zip(inds,inds[1:])]
    return rez

def dropst(serie):
    serie = serie.to_numpy().tolist()
    serie = list(dict.fromkeys(serie))
    return '\n'.join(serie)

def joining_(series):
    series = series.to_numpy().tolist()
    if series:
        split_li = split_by_threshold(series)
        a=[]
        for x in split_li:
            if x[-1]-x[0]:
                a.append(str(x[0])+'-'+str(x[-1]))
        return '\n'.join(a)
    else:
        return 'None'

col_1, col_2, col_3 = d.columns
final = d.groupby([col_1], as_index = False).agg(
    {   col_1: 'first',
        col_2: dropst,
        col_3: joining_}
)

print(final)

我收到的答案是:

   a             b         c
0  a  Num_1\nNum_2  0-2\n5-9

我有点需要:

   a   b      c
0  a   Num_2  5-9

IIUC,你可以groupbya,b,最终一个新的组来识别连续的值。然后 agg 使用自定义函数:

def join(s, thresh=4):
    MIN = s.min()
    MAX = s.max()
    return f'{MIN}-{MAX}' if MAX-MIN >= thresh else float('nan')

consecutive = df['c'].diff().ne(1).cumsum()
# could also be
# df.groupby(['a','b'])['c'].diff().ne(1).cumsum()
# but not required as we anyway group by those later

(df
 .groupby(['a', 'b', consecutive], as_index=False)
 ['c']
 .agg(join, thresh=4)
 .dropna(subset='c')
 )

输出:

   a      b    c
2  a  Num_2  5-9