使用 pandas 从 python 中的列获取变量的出现次数
Obtaining number of occurrences of a variable from a column in python using pandas
我正处于使用python分析数据的学习阶段,偶然发现了一个疑问。
考虑以下数据集:
print (df)
CITY OCCUPATION
0 BANGALORE MECHANICAL ENGINEER
1 BANGALORE COMPUTER SCIENCE ENGINEER
2 BANGALORE MECHANICAL ENGINEER
3 BANGALORE COMPUTER SCIENCE ENGINEER
4 BANGALORE COMPUTER SCIENCE ENGINEER
5 MUMBAI ACTOR
6 MUMBAI ACTOR
7 MUMBAI SHARE BROKER
8 MUMBAI SHARE BROKER
9 MUMBAI ACTOR
10 CHENNAI RETIRED
11 CHENNAI LAND DEVELOPER
12 CHENNAI MECHANICAL ENGINEER
13 CHENNAI MECHANICAL ENGINEER
14 CHENNAI MECHANICAL ENGINEER
15 DELHI PHYSICIAN
16 DELHI PHYSICIAN
17 DELHI JOURNALIST
18 DELHI JOURNALIST
19 DELHI ACTOR
20 PUNE MANAGER
21 PUNE MANAGER
22 PUNE MANAGER
如何使用 pandas 从特定州获得最大数量的工作。
例如:
STATE OCCUPATION
----------------
BANGALORE - COMPUTER SCIENCE ENGINEER
-----------------------------------
MUMBAI - ACTOR
------------
第一个解是groupby
with Counter
and most_common
:
因为 DELHI
与 JOURNALIST
和 PHYSICIAN
的数字相同 2
,所以解的输出不同。
from collections import Counter
df1 = df.groupby('CITY').OCCUPATION
.apply(lambda x: Counter(x).most_common(1)[0][0])
.reset_index()
print (df1)
CITY OCCUPATION
0 BANGALORE COMPUTER SCIENCE ENGINEER
1 CHENNAI MECHANICAL ENGINEER
2 DELHI PHYSICIAN
3 MUMBAI ACTOR
4 PUNE MANAGER
groupby
, size
and nlargest
的另一个解决方案:
df1 = df.groupby(['CITY', 'OCCUPATION'])
.size()
.groupby(level=0)
.nlargest(1)
.reset_index(level=0,drop=True)
.reset_index(name='a')
.drop('a', axis=1)
print (df1)
CITY OCCUPATION
0 BANGALORE COMPUTER SCIENCE ENGINEER
1 CHENNAI MECHANICAL ENGINEER
2 DELHI JOURNALIST
3 MUMBAI ACTOR
4 PUNE MANAGER
编辑:
对于调试,这里是最好的自定义函数,它与 lambda 函数相同:
from collections import Counter
def f(x):
#print Series
print (x)
#count values by Counter
print (Counter(x).most_common())
#get first top value - list ogf tuple
print (Counter(x).most_common(1))
#select list by indexing [0] - output is tuple
print (Counter(x).most_common(1)[0])
#select first value of tuple by another [0]
#for selecting count use [1] instead [0]
print (Counter(x).most_common(1)[0][0])
return Counter(x).most_common(1)[0][0]
df1 = df.groupby('CITY').OCCUPATION.apply(f).reset_index()
print (df1)
CITY OCCUPATION
0 BANGALORE COMPUTER SCIENCE ENGINEER
1 CHENNAI MECHANICAL ENGINEER
2 DELHI JOURNALIST
3 MUMBAI ACTOR
4 PUNE MANAGER
我正处于使用python分析数据的学习阶段,偶然发现了一个疑问。 考虑以下数据集:
print (df)
CITY OCCUPATION
0 BANGALORE MECHANICAL ENGINEER
1 BANGALORE COMPUTER SCIENCE ENGINEER
2 BANGALORE MECHANICAL ENGINEER
3 BANGALORE COMPUTER SCIENCE ENGINEER
4 BANGALORE COMPUTER SCIENCE ENGINEER
5 MUMBAI ACTOR
6 MUMBAI ACTOR
7 MUMBAI SHARE BROKER
8 MUMBAI SHARE BROKER
9 MUMBAI ACTOR
10 CHENNAI RETIRED
11 CHENNAI LAND DEVELOPER
12 CHENNAI MECHANICAL ENGINEER
13 CHENNAI MECHANICAL ENGINEER
14 CHENNAI MECHANICAL ENGINEER
15 DELHI PHYSICIAN
16 DELHI PHYSICIAN
17 DELHI JOURNALIST
18 DELHI JOURNALIST
19 DELHI ACTOR
20 PUNE MANAGER
21 PUNE MANAGER
22 PUNE MANAGER
如何使用 pandas 从特定州获得最大数量的工作。 例如:
STATE OCCUPATION
----------------
BANGALORE - COMPUTER SCIENCE ENGINEER
-----------------------------------
MUMBAI - ACTOR
------------
第一个解是groupby
with Counter
and most_common
:
因为 DELHI
与 JOURNALIST
和 PHYSICIAN
的数字相同 2
,所以解的输出不同。
from collections import Counter
df1 = df.groupby('CITY').OCCUPATION
.apply(lambda x: Counter(x).most_common(1)[0][0])
.reset_index()
print (df1)
CITY OCCUPATION
0 BANGALORE COMPUTER SCIENCE ENGINEER
1 CHENNAI MECHANICAL ENGINEER
2 DELHI PHYSICIAN
3 MUMBAI ACTOR
4 PUNE MANAGER
groupby
, size
and nlargest
的另一个解决方案:
df1 = df.groupby(['CITY', 'OCCUPATION'])
.size()
.groupby(level=0)
.nlargest(1)
.reset_index(level=0,drop=True)
.reset_index(name='a')
.drop('a', axis=1)
print (df1)
CITY OCCUPATION
0 BANGALORE COMPUTER SCIENCE ENGINEER
1 CHENNAI MECHANICAL ENGINEER
2 DELHI JOURNALIST
3 MUMBAI ACTOR
4 PUNE MANAGER
编辑:
对于调试,这里是最好的自定义函数,它与 lambda 函数相同:
from collections import Counter
def f(x):
#print Series
print (x)
#count values by Counter
print (Counter(x).most_common())
#get first top value - list ogf tuple
print (Counter(x).most_common(1))
#select list by indexing [0] - output is tuple
print (Counter(x).most_common(1)[0])
#select first value of tuple by another [0]
#for selecting count use [1] instead [0]
print (Counter(x).most_common(1)[0][0])
return Counter(x).most_common(1)[0][0]
df1 = df.groupby('CITY').OCCUPATION.apply(f).reset_index()
print (df1)
CITY OCCUPATION
0 BANGALORE COMPUTER SCIENCE ENGINEER
1 CHENNAI MECHANICAL ENGINEER
2 DELHI JOURNALIST
3 MUMBAI ACTOR
4 PUNE MANAGER