pandas:出现次数最多的字符串列
pandas: column with highest occurring string
我有 table 个具有混合分类的单词。我想让 'common type' 列是出现次数最多的(模式)分类标签,这样每一行都有一个标签。
word type common type
post | WORK_OF_ART | WORK_OF_ART
post | WORK_OF_ART | WORK_OF_ART
post | WORK_OF_ART | WORK_OF_ART
post | WORK_OF_ART | WORK_OF_ART
post | WORK_OF_ART | WORK_OF_ART
post | OTHER | WORK_OF_ART
post | WORK_OF_ART | WORK_OF_ART
post | WORK_OF_ART | WORK_OF_ART
post | OTHER | WORK_OF_ART
-----|--------------------------
sign | OTHER | OTHER
sign | WORK_OF_ART | OTHER
sign | OTHER | OTHER
sign | WORK_OF_ART | OTHER
sign | OTHER | OTHER
sign | OTHER | OTHER
sign | WORK_OF_ART | OTHER
我使用了以下函数,但是在 1m+ 行的数据帧上,运行时间非常糟糕
def replace_most_common_type(frame, word):
common_type = frame[frame['word']==word]['type'].value_counts().idxmax()
frame.loc[frame['word']==word, 'type'] = common_type
unique_words = master_frame['word'].unique()
for idx, word in unique_words:
replace_most_common_type(master_frame, word)
内置 pandas 方法往往是 numpy 向量化的,因此任何使用本机 pandas 函数的解决方案都值得赞赏
鉴于您的数据:
In [1]: df
Out[1]:
word type
0 post WORK_OF_ART
1 post WORK_OF_ART
2 post WORK_OF_ART
3 post WORK_OF_ART
4 post WORK_OF_ART
5 post OTHER
6 post WORK_OF_ART
7 post WORK_OF_ART
8 post OTHER
9 sign OTHER
10 sign WORK_OF_ART
11 sign OTHER
12 sign WORK_OF_ART
13 sign OTHER
14 sign OTHER
15 sign WORK_OF_ART
您可以按单词分组,然后使用 value_counts
找到每个单词最常见的类型,如 this answer 所示。请注意,您可以将 "most common" 系列保存到一个变量,然后重命名它,这样您的列名就不会冲突。
In [2]: s = df.groupby('word')['type'].agg(lambda x: x.value_counts().index[0])
...: s.name = 'common type'
...: df.merge(s, on='word')
Out[2]:
word type common type
0 post WORK_OF_ART WORK_OF_ART
1 post WORK_OF_ART WORK_OF_ART
2 post WORK_OF_ART WORK_OF_ART
3 post WORK_OF_ART WORK_OF_ART
4 post WORK_OF_ART WORK_OF_ART
5 post OTHER WORK_OF_ART
6 post WORK_OF_ART WORK_OF_ART
7 post WORK_OF_ART WORK_OF_ART
8 post OTHER WORK_OF_ART
9 sign OTHER OTHER
10 sign WORK_OF_ART OTHER
11 sign OTHER OTHER
12 sign WORK_OF_ART OTHER
13 sign OTHER OTHER
14 sign OTHER OTHER
15 sign WORK_OF_ART OTHER
我有 table 个具有混合分类的单词。我想让 'common type' 列是出现次数最多的(模式)分类标签,这样每一行都有一个标签。
word type common type
post | WORK_OF_ART | WORK_OF_ART
post | WORK_OF_ART | WORK_OF_ART
post | WORK_OF_ART | WORK_OF_ART
post | WORK_OF_ART | WORK_OF_ART
post | WORK_OF_ART | WORK_OF_ART
post | OTHER | WORK_OF_ART
post | WORK_OF_ART | WORK_OF_ART
post | WORK_OF_ART | WORK_OF_ART
post | OTHER | WORK_OF_ART
-----|--------------------------
sign | OTHER | OTHER
sign | WORK_OF_ART | OTHER
sign | OTHER | OTHER
sign | WORK_OF_ART | OTHER
sign | OTHER | OTHER
sign | OTHER | OTHER
sign | WORK_OF_ART | OTHER
我使用了以下函数,但是在 1m+ 行的数据帧上,运行时间非常糟糕
def replace_most_common_type(frame, word):
common_type = frame[frame['word']==word]['type'].value_counts().idxmax()
frame.loc[frame['word']==word, 'type'] = common_type
unique_words = master_frame['word'].unique()
for idx, word in unique_words:
replace_most_common_type(master_frame, word)
内置 pandas 方法往往是 numpy 向量化的,因此任何使用本机 pandas 函数的解决方案都值得赞赏
鉴于您的数据:
In [1]: df
Out[1]:
word type
0 post WORK_OF_ART
1 post WORK_OF_ART
2 post WORK_OF_ART
3 post WORK_OF_ART
4 post WORK_OF_ART
5 post OTHER
6 post WORK_OF_ART
7 post WORK_OF_ART
8 post OTHER
9 sign OTHER
10 sign WORK_OF_ART
11 sign OTHER
12 sign WORK_OF_ART
13 sign OTHER
14 sign OTHER
15 sign WORK_OF_ART
您可以按单词分组,然后使用 value_counts
找到每个单词最常见的类型,如 this answer 所示。请注意,您可以将 "most common" 系列保存到一个变量,然后重命名它,这样您的列名就不会冲突。
In [2]: s = df.groupby('word')['type'].agg(lambda x: x.value_counts().index[0])
...: s.name = 'common type'
...: df.merge(s, on='word')
Out[2]:
word type common type
0 post WORK_OF_ART WORK_OF_ART
1 post WORK_OF_ART WORK_OF_ART
2 post WORK_OF_ART WORK_OF_ART
3 post WORK_OF_ART WORK_OF_ART
4 post WORK_OF_ART WORK_OF_ART
5 post OTHER WORK_OF_ART
6 post WORK_OF_ART WORK_OF_ART
7 post WORK_OF_ART WORK_OF_ART
8 post OTHER WORK_OF_ART
9 sign OTHER OTHER
10 sign WORK_OF_ART OTHER
11 sign OTHER OTHER
12 sign WORK_OF_ART OTHER
13 sign OTHER OTHER
14 sign OTHER OTHER
15 sign WORK_OF_ART OTHER