在 PANDAS 中是否有一个分组,但前提是它有一个特定的数字
Is there a grouping by in PANDAS but only if it has a specific number
我在使用我的 Dataframe 时遇到了一些问题。我在下面有以下 DF。我试图分组,一行用“-”分隔,其他只是简单地\n。我遇到的问题是我需要连续有一定数量的数字(最少 4 个)。
a b c
0 a Num_1 0
1 a Num_1 1
2 a Num_1 2
3 a Num_2 5
4 a Num_2 6
5 a Num_2 7
6 a Num_2 8
7 a Num_2 9
我编写了以下代码:
def split_by_threshold(li):
inds = [0]+[ind for ind,(i,j) in enumerate(zip(li,li[1:]),1) if j-i != 1]+[len(li)+1]
rez = [li[i:j] for i,j in zip(inds,inds[1:])]
return rez
def dropst(serie):
serie = serie.to_numpy().tolist()
serie = list(dict.fromkeys(serie))
return '\n'.join(serie)
def joining_(series):
series = series.to_numpy().tolist()
if series:
split_li = split_by_threshold(series)
a=[]
for x in split_li:
if x[-1]-x[0]:
a.append(str(x[0])+'-'+str(x[-1]))
return '\n'.join(a)
else:
return 'None'
col_1, col_2, col_3 = d.columns
final = d.groupby([col_1], as_index = False).agg(
{ col_1: 'first',
col_2: dropst,
col_3: joining_}
)
print(final)
我收到的答案是:
a b c
0 a Num_1\nNum_2 0-2\n5-9
我有点需要:
a b c
0 a Num_2 5-9
IIUC,你可以groupby
a,b,最终一个新的组来识别连续的值。然后 agg
使用自定义函数:
def join(s, thresh=4):
MIN = s.min()
MAX = s.max()
return f'{MIN}-{MAX}' if MAX-MIN >= thresh else float('nan')
consecutive = df['c'].diff().ne(1).cumsum()
# could also be
# df.groupby(['a','b'])['c'].diff().ne(1).cumsum()
# but not required as we anyway group by those later
(df
.groupby(['a', 'b', consecutive], as_index=False)
['c']
.agg(join, thresh=4)
.dropna(subset='c')
)
输出:
a b c
2 a Num_2 5-9
我在使用我的 Dataframe 时遇到了一些问题。我在下面有以下 DF。我试图分组,一行用“-”分隔,其他只是简单地\n。我遇到的问题是我需要连续有一定数量的数字(最少 4 个)。
a b c
0 a Num_1 0
1 a Num_1 1
2 a Num_1 2
3 a Num_2 5
4 a Num_2 6
5 a Num_2 7
6 a Num_2 8
7 a Num_2 9
我编写了以下代码:
def split_by_threshold(li):
inds = [0]+[ind for ind,(i,j) in enumerate(zip(li,li[1:]),1) if j-i != 1]+[len(li)+1]
rez = [li[i:j] for i,j in zip(inds,inds[1:])]
return rez
def dropst(serie):
serie = serie.to_numpy().tolist()
serie = list(dict.fromkeys(serie))
return '\n'.join(serie)
def joining_(series):
series = series.to_numpy().tolist()
if series:
split_li = split_by_threshold(series)
a=[]
for x in split_li:
if x[-1]-x[0]:
a.append(str(x[0])+'-'+str(x[-1]))
return '\n'.join(a)
else:
return 'None'
col_1, col_2, col_3 = d.columns
final = d.groupby([col_1], as_index = False).agg(
{ col_1: 'first',
col_2: dropst,
col_3: joining_}
)
print(final)
我收到的答案是:
a b c
0 a Num_1\nNum_2 0-2\n5-9
我有点需要:
a b c
0 a Num_2 5-9
IIUC,你可以groupby
a,b,最终一个新的组来识别连续的值。然后 agg
使用自定义函数:
def join(s, thresh=4):
MIN = s.min()
MAX = s.max()
return f'{MIN}-{MAX}' if MAX-MIN >= thresh else float('nan')
consecutive = df['c'].diff().ne(1).cumsum()
# could also be
# df.groupby(['a','b'])['c'].diff().ne(1).cumsum()
# but not required as we anyway group by those later
(df
.groupby(['a', 'b', consecutive], as_index=False)
['c']
.agg(join, thresh=4)
.dropna(subset='c')
)
输出:
a b c
2 a Num_2 5-9