Pandas 将小于 x 的所有聚合分组
Pandas group all aggregates smaller then x
我正在尝试通过 pandas 中的聚合找到一种更高级的组。例如:
d = {'name': ['a', 'a', 'b', 'b', 'b', 'c', 'd', 'e'], 'amount': [2, 5, 2, 3, 7, 2, 4, 1]}
df = pd.DataFrame(data=d)
df_per_category = df.groupby(['name']) \
.agg({'amount': ['count', 'sum']}) \
.sort_values(by=[('amount', 'count')], ascending=False)
df_per_category[('amount', 'sum')].plot.barh()
df_per_category
产生:
amount
count
sum
Name
b
3
12
a
2
7
c
1
2
d
1
4
e
1
1
如果您有一个数据集,其中 70% 的项目只有一个计数,而 30% 的项目有多个计数,如果您能将这 70% 分组就更好了。首先,为了简单起见,只需将所有只有一个计数的记录分组,然后将它们放在像 other
这样的名称下。所以结果看起来像:
amount
count
sum
Name
b
3
12
a
2
7
other
3
7
有熊猫的方法吗?现在我正在考虑循环遍历我的聚合结果并手动创建一个新的数据框。
当前解决方案:
name = []
count = []
amount = []
aggregates = {
5: [0, 0],
10: [0, 0],
25: [0, 0],
50: [0, 0],
}
l = list(aggregates)
first_aggregates = l
last_aggregate = l[-1] + 1
aggregates.update({last_aggregate: [0, 0]})
def aggregate_small_values(c):
n = c.name
s = c[('amount', 'sum')]
c = c[('amount', 'count')]
if c <= 2:
if s < last_aggregate:
for a in first_aggregates:
if s <= a:
aggregates[a][0] += c
aggregates[a][1] += s
break
else:
aggregates[last_aggregate][0] += c
aggregates[last_aggregate][1] += s
else:
name.append(n)
count.append(c)
amount.append(s)
df_per_category.apply(aggregate_small_values, axis=1)
for a in first_aggregates:
name.append(f'{a} and smaller')
count.append(aggregates[a][0])
amount.append(aggregates[a][1])
name.append(f'{last_aggregate} and bigger')
count.append(aggregates[last_aggregate][0])
amount.append(aggregates[last_aggregate][1])
df_agg = pd.DataFrame(index=name, data={'count': count, 'amount': amount})
df_agg.plot.barh(title='Boodschappen 2021')
df_agg
产生类似的东西:
如果需要用 other
替换 name
如果计数是 1
使用 Series.duplicated
和 keep=False
:
df.loc[~df['name'].duplicated(keep=False), 'name'] = 'other'
print (df)
name amount
0 a 2
1 a 5
2 b 2
3 b 3
4 b 7
5 other 2
6 other 4
7 other 1
如果需要用百分比替换,下面设置 20%
other
使用 Series.value_counts
with normalize=True
and then use Series.map
与原始大小相同的掩码 df
:
s = df['name'].value_counts(normalize=True)
print (s)
b 0.375
a 0.250
d 0.125
e 0.125
c 0.125
Name: name, dtype: float64
df.loc[df['name'].map(s).lt(0.2), 'name'] = 'other'
print (df)
name amount
0 a 2
1 a 5
2 b 2
3 b 3
4 b 7
5 other 2
6 other 4
7 other 1
按计数过滤,下面3
:
s = df['name'].value_counts()
print (s)
b 3
a 2
d 1
e 1
c 1
Name: name, dtype: int64
df.loc[df['name'].map(s).lt(3), 'name'] = 'other'
print (df)
name amount
0 other 2
1 other 5
2 b 2
3 b 3
4 b 7
5 other 2
6 other 4
7 other 1
我正在尝试通过 pandas 中的聚合找到一种更高级的组。例如:
d = {'name': ['a', 'a', 'b', 'b', 'b', 'c', 'd', 'e'], 'amount': [2, 5, 2, 3, 7, 2, 4, 1]}
df = pd.DataFrame(data=d)
df_per_category = df.groupby(['name']) \
.agg({'amount': ['count', 'sum']}) \
.sort_values(by=[('amount', 'count')], ascending=False)
df_per_category[('amount', 'sum')].plot.barh()
df_per_category
产生:
amount | ||
---|---|---|
count | sum | |
Name | ||
b | 3 | 12 |
a | 2 | 7 |
c | 1 | 2 |
d | 1 | 4 |
e | 1 | 1 |
如果您有一个数据集,其中 70% 的项目只有一个计数,而 30% 的项目有多个计数,如果您能将这 70% 分组就更好了。首先,为了简单起见,只需将所有只有一个计数的记录分组,然后将它们放在像 other
这样的名称下。所以结果看起来像:
amount | ||
---|---|---|
count | sum | |
Name | ||
b | 3 | 12 |
a | 2 | 7 |
other | 3 | 7 |
有熊猫的方法吗?现在我正在考虑循环遍历我的聚合结果并手动创建一个新的数据框。
当前解决方案:
name = []
count = []
amount = []
aggregates = {
5: [0, 0],
10: [0, 0],
25: [0, 0],
50: [0, 0],
}
l = list(aggregates)
first_aggregates = l
last_aggregate = l[-1] + 1
aggregates.update({last_aggregate: [0, 0]})
def aggregate_small_values(c):
n = c.name
s = c[('amount', 'sum')]
c = c[('amount', 'count')]
if c <= 2:
if s < last_aggregate:
for a in first_aggregates:
if s <= a:
aggregates[a][0] += c
aggregates[a][1] += s
break
else:
aggregates[last_aggregate][0] += c
aggregates[last_aggregate][1] += s
else:
name.append(n)
count.append(c)
amount.append(s)
df_per_category.apply(aggregate_small_values, axis=1)
for a in first_aggregates:
name.append(f'{a} and smaller')
count.append(aggregates[a][0])
amount.append(aggregates[a][1])
name.append(f'{last_aggregate} and bigger')
count.append(aggregates[last_aggregate][0])
amount.append(aggregates[last_aggregate][1])
df_agg = pd.DataFrame(index=name, data={'count': count, 'amount': amount})
df_agg.plot.barh(title='Boodschappen 2021')
df_agg
产生类似的东西:
如果需要用 other
替换 name
如果计数是 1
使用 Series.duplicated
和 keep=False
:
df.loc[~df['name'].duplicated(keep=False), 'name'] = 'other'
print (df)
name amount
0 a 2
1 a 5
2 b 2
3 b 3
4 b 7
5 other 2
6 other 4
7 other 1
如果需要用百分比替换,下面设置 20%
other
使用 Series.value_counts
with normalize=True
and then use Series.map
与原始大小相同的掩码 df
:
s = df['name'].value_counts(normalize=True)
print (s)
b 0.375
a 0.250
d 0.125
e 0.125
c 0.125
Name: name, dtype: float64
df.loc[df['name'].map(s).lt(0.2), 'name'] = 'other'
print (df)
name amount
0 a 2
1 a 5
2 b 2
3 b 3
4 b 7
5 other 2
6 other 4
7 other 1
按计数过滤,下面3
:
s = df['name'].value_counts()
print (s)
b 3
a 2
d 1
e 1
c 1
Name: name, dtype: int64
df.loc[df['name'].map(s).lt(3), 'name'] = 'other'
print (df)
name amount
0 other 2
1 other 5
2 b 2
3 b 3
4 b 7
5 other 2
6 other 4
7 other 1