使用 pandas 从 python 中的列获取变量的出现次数

Question

我正处于使用python分析数据的学习阶段，偶然发现了一个疑问。考虑以下数据集：

print (df)
         CITY                 OCCUPATION
0   BANGALORE        MECHANICAL ENGINEER
1   BANGALORE  COMPUTER SCIENCE ENGINEER
2   BANGALORE        MECHANICAL ENGINEER
3   BANGALORE  COMPUTER SCIENCE ENGINEER
4   BANGALORE  COMPUTER SCIENCE ENGINEER
5      MUMBAI                      ACTOR
6      MUMBAI                      ACTOR
7      MUMBAI               SHARE BROKER
8      MUMBAI               SHARE BROKER
9      MUMBAI                      ACTOR
10    CHENNAI                    RETIRED
11    CHENNAI             LAND DEVELOPER
12    CHENNAI        MECHANICAL ENGINEER
13    CHENNAI        MECHANICAL ENGINEER
14    CHENNAI        MECHANICAL ENGINEER
15      DELHI                  PHYSICIAN
16      DELHI                  PHYSICIAN
17      DELHI                 JOURNALIST
18      DELHI                 JOURNALIST
19      DELHI                      ACTOR
20       PUNE                    MANAGER
21       PUNE                    MANAGER
22       PUNE                    MANAGER

如何使用 pandas 从特定州获得最大数量的工作。例如：

STATE OCCUPATION
----------------

BANGALORE - COMPUTER SCIENCE ENGINEER
-----------------------------------

MUMBAI - ACTOR
------------

Answer 1

第一个解是groupby with Counter and most_common:

因为 DELHI 与 JOURNALIST 和 PHYSICIAN 的数字相同 2，所以解的输出不同。

from collections import Counter

df1 = df.groupby('CITY').OCCUPATION
        .apply(lambda x: Counter(x).most_common(1)[0][0])
        .reset_index()
print (df1)
        CITY                 OCCUPATION
0  BANGALORE  COMPUTER SCIENCE ENGINEER
1    CHENNAI        MECHANICAL ENGINEER
2      DELHI                  PHYSICIAN
3     MUMBAI                      ACTOR
4       PUNE                    MANAGER

groupby, size and nlargest的另一个解决方案：

df1 = df.groupby(['CITY', 'OCCUPATION'])
        .size()
        .groupby(level=0)
        .nlargest(1)
        .reset_index(level=0,drop=True)
        .reset_index(name='a')
        .drop('a', axis=1)
print (df1)
        CITY                 OCCUPATION
0  BANGALORE  COMPUTER SCIENCE ENGINEER
1    CHENNAI        MECHANICAL ENGINEER
2      DELHI                 JOURNALIST
3     MUMBAI                      ACTOR
4       PUNE                    MANAGER

编辑：

对于调试，这里是最好的自定义函数，它与 lambda 函数相同：

from collections import Counter

def f(x):
    #print Series  
    print (x)
    #count values by Counter
    print (Counter(x).most_common())
    #get first top value - list ogf tuple
    print (Counter(x).most_common(1))
    #select list by indexing [0] - output is tuple
    print (Counter(x).most_common(1)[0])
    #select first value of tuple by another [0]
    #for selecting count use [1] instead [0]
    print (Counter(x).most_common(1)[0][0])
    return Counter(x).most_common(1)[0][0]

df1 = df.groupby('CITY').OCCUPATION.apply(f).reset_index()
print (df1)
        CITY                 OCCUPATION
0  BANGALORE  COMPUTER SCIENCE ENGINEER
1    CHENNAI        MECHANICAL ENGINEER
2      DELHI                 JOURNALIST
3     MUMBAI                      ACTOR
4       PUNE                    MANAGER

使用 pandas 从 python 中的列获取变量的出现次数

Obtaining number of occurrences of a variable from a column in python using pandas

python

numpy

data-analysis

pandas