Python - Pandas 百分位分布小于 80%
Python - Pandas Percentile distribution less than 80%
我有以下数据框,如何创建一个新列,其中包含代表所有值 80% 的城市?在这种情况下,它们是 'a'、'b' 和 'c'。其余城市的标签应为 'other'.
values = ['a','a','a','a','a','a','a','a','b','b','b','b','b','b','c','c','c','c','c','d','d','d','e','e','f']
db = pd.DataFrame(values,columns = ['city'])
db['city'].value_counts(normalize=True)
a 0.32
b 0.24
c 0.20
d 0.12
e 0.08
f 0.04
期望的输出
db['city_freq'] = ['a','a','a','a','a','a','a','a','b','b','b','b','b','b','c','c','c','c','c','other','other','other','other','other','other']
用 Series.cumsum
with condition, get index
values anf then compare original by Series.isin
with DataFrame.loc
的累积总和过滤所有值以替换值:
s = db['city'].value_counts(normalize=True).cumsum()
print (s)
a 0.32
b 0.56
c 0.76
d 0.88
e 0.96
f 1.00
print (s.index[s > 0.8])
Index(['d', 'e', 'f'], dtype='object')
db.loc[db['city'].isin(s.index[s > 0.8]), 'city'] = 'other'
print (db)
city
0 a
1 a
2 a
3 a
4 a
5 a
6 a
7 a
8 b
9 b
10 b
11 b
12 b
13 b
14 c
15 c
16 c
17 c
18 c
19 other
20 other
21 other
22 other
23 other
24 other
另一种解决方案 Series.map
通过累加和然后通过阈值进行比较:
s = db['city'].value_counts(normalize=True).cumsum()
db.loc[db['city'].map(s) > 0.8, 'city'] = 'other'
详情:
print (db['city'].map(s))
0 0.32
1 0.32
2 0.32
3 0.32
4 0.32
5 0.32
6 0.32
7 0.32
8 0.56
9 0.56
10 0.56
11 0.56
12 0.56
13 0.56
14 0.76
15 0.76
16 0.76
17 0.76
18 0.76
19 0.88
20 0.88
21 0.88
22 0.96
23 0.96
24 1.00
Name: city, dtype: float64
我有以下数据框,如何创建一个新列,其中包含代表所有值 80% 的城市?在这种情况下,它们是 'a'、'b' 和 'c'。其余城市的标签应为 'other'.
values = ['a','a','a','a','a','a','a','a','b','b','b','b','b','b','c','c','c','c','c','d','d','d','e','e','f']
db = pd.DataFrame(values,columns = ['city'])
db['city'].value_counts(normalize=True)
a 0.32
b 0.24
c 0.20
d 0.12
e 0.08
f 0.04
期望的输出
db['city_freq'] = ['a','a','a','a','a','a','a','a','b','b','b','b','b','b','c','c','c','c','c','other','other','other','other','other','other']
用 Series.cumsum
with condition, get index
values anf then compare original by Series.isin
with DataFrame.loc
的累积总和过滤所有值以替换值:
s = db['city'].value_counts(normalize=True).cumsum()
print (s)
a 0.32
b 0.56
c 0.76
d 0.88
e 0.96
f 1.00
print (s.index[s > 0.8])
Index(['d', 'e', 'f'], dtype='object')
db.loc[db['city'].isin(s.index[s > 0.8]), 'city'] = 'other'
print (db)
city
0 a
1 a
2 a
3 a
4 a
5 a
6 a
7 a
8 b
9 b
10 b
11 b
12 b
13 b
14 c
15 c
16 c
17 c
18 c
19 other
20 other
21 other
22 other
23 other
24 other
另一种解决方案 Series.map
通过累加和然后通过阈值进行比较:
s = db['city'].value_counts(normalize=True).cumsum()
db.loc[db['city'].map(s) > 0.8, 'city'] = 'other'
详情:
print (db['city'].map(s))
0 0.32
1 0.32
2 0.32
3 0.32
4 0.32
5 0.32
6 0.32
7 0.32
8 0.56
9 0.56
10 0.56
11 0.56
12 0.56
13 0.56
14 0.76
15 0.76
16 0.76
17 0.76
18 0.76
19 0.88
20 0.88
21 0.88
22 0.96
23 0.96
24 1.00
Name: city, dtype: float64