Python

Question

我有以下数据框，如何创建一个新列，其中包含代表所有值 80% 的城市？在这种情况下，它们是 'a'、'b' 和 'c'。其余城市的标签应为 'other'.

values = ['a','a','a','a','a','a','a','a','b','b','b','b','b','b','c','c','c','c','c','d','d','d','e','e','f']
db = pd.DataFrame(values,columns = ['city'])

db['city'].value_counts(normalize=True)

a    0.32
b    0.24
c    0.20
d    0.12
e    0.08
f    0.04

期望的输出

db['city_freq'] = ['a','a','a','a','a','a','a','a','b','b','b','b','b','b','c','c','c','c','c','other','other','other','other','other','other']

Answer 1

用 Series.cumsum with condition, get index values anf then compare original by Series.isin with DataFrame.loc 的累积总和过滤所有值以替换值：

s = db['city'].value_counts(normalize=True).cumsum()

print (s)
a    0.32
b    0.56
c    0.76
d    0.88
e    0.96
f    1.00

print (s.index[s > 0.8])
Index(['d', 'e', 'f'], dtype='object')

db.loc[db['city'].isin(s.index[s > 0.8]), 'city'] = 'other'
print (db)
     city
0       a
1       a
2       a
3       a
4       a
5       a
6       a
7       a
8       b
9       b
10      b
11      b
12      b
13      b
14      c
15      c
16      c
17      c
18      c
19  other
20  other
21  other
22  other
23  other
24  other

另一种解决方案 Series.map 通过累加和然后通过阈值进行比较：

s = db['city'].value_counts(normalize=True).cumsum()

db.loc[db['city'].map(s) > 0.8, 'city'] = 'other'

详情:

print (db['city'].map(s))
0     0.32
1     0.32
2     0.32
3     0.32
4     0.32
5     0.32
6     0.32
7     0.32
8     0.56
9     0.56
10    0.56
11    0.56
12    0.56
13    0.56
14    0.76
15    0.76
16    0.76
17    0.76
18    0.76
19    0.88
20    0.88
21    0.88
22    0.96
23    0.96
24    1.00
Name: city, dtype: float64

Python - Pandas 百分位分布小于 80%

Python - Pandas Percentile distribution less than 80%

percentile

pandas