pandas：出现次数最多的字符串列

Question

我有 table 个具有混合分类的单词。我想让 'common type' 列是出现次数最多的（模式）分类标签，这样每一行都有一个标签。

word   type          common type
post | WORK_OF_ART | WORK_OF_ART 
post | WORK_OF_ART | WORK_OF_ART 
post | WORK_OF_ART | WORK_OF_ART 
post | WORK_OF_ART | WORK_OF_ART 
post | WORK_OF_ART | WORK_OF_ART 
post |       OTHER | WORK_OF_ART 
post | WORK_OF_ART | WORK_OF_ART 
post | WORK_OF_ART | WORK_OF_ART 
post |       OTHER | WORK_OF_ART
-----|--------------------------
sign |       OTHER | OTHER
sign | WORK_OF_ART | OTHER 
sign |       OTHER | OTHER
sign | WORK_OF_ART | OTHER 
sign |       OTHER | OTHER 
sign |       OTHER | OTHER 
sign | WORK_OF_ART | OTHER

我使用了以下函数，但是在 1m+ 行的数据帧上，运行时间非常糟糕

def replace_most_common_type(frame, word):
    common_type = frame[frame['word']==word]['type'].value_counts().idxmax()
    frame.loc[frame['word']==word, 'type'] = common_type

unique_words = master_frame['word'].unique()
for idx, word in unique_words:
    replace_most_common_type(master_frame, word)

内置 pandas 方法往往是 numpy 向量化的，因此任何使用本机 pandas 函数的解决方案都值得赞赏

Answer 1

鉴于您的数据：

In [1]: df
Out[1]:
    word         type
0   post  WORK_OF_ART
1   post  WORK_OF_ART
2   post  WORK_OF_ART
3   post  WORK_OF_ART
4   post  WORK_OF_ART
5   post        OTHER
6   post  WORK_OF_ART
7   post  WORK_OF_ART
8   post        OTHER
9   sign        OTHER
10  sign  WORK_OF_ART
11  sign        OTHER
12  sign  WORK_OF_ART
13  sign        OTHER
14  sign        OTHER
15  sign  WORK_OF_ART

您可以按单词分组，然后使用 value_counts 找到每个单词最常见的类型，如 this answer 所示。请注意，您可以将 "most common" 系列保存到一个变量，然后重命名它，这样您的列名就不会冲突。

In [2]: s = df.groupby('word')['type'].agg(lambda x: x.value_counts().index[0])
   ...: s.name = 'common type'
   ...: df.merge(s, on='word')
Out[2]:
    word         type  common type
0   post  WORK_OF_ART  WORK_OF_ART
1   post  WORK_OF_ART  WORK_OF_ART
2   post  WORK_OF_ART  WORK_OF_ART
3   post  WORK_OF_ART  WORK_OF_ART
4   post  WORK_OF_ART  WORK_OF_ART
5   post        OTHER  WORK_OF_ART
6   post  WORK_OF_ART  WORK_OF_ART
7   post  WORK_OF_ART  WORK_OF_ART
8   post        OTHER  WORK_OF_ART
9   sign        OTHER        OTHER
10  sign  WORK_OF_ART        OTHER
11  sign        OTHER        OTHER
12  sign  WORK_OF_ART        OTHER
13  sign        OTHER        OTHER
14  sign        OTHER        OTHER
15  sign  WORK_OF_ART        OTHER

pandas：出现次数最多的字符串列

pandas: column with highest occurring string

python

pandas

categorical-data