Pandas Groupby Agg：获取在列表列中出现次数最多的字符串

Question

我有一个像这样的数据框，其中 IDs 和 Preferences 在由 ', ':

分隔的字符串中

ID	Preferences
1	banana, apple
1	banana, apple, kiwi
1	avocado, apple
2	avocado, grapes
2	banana, apple, kiwi

我想按 ID 分组并获得出现次数最多的 2 个偏好，所以结果如下：

ID	first_preference	second_preference
1	apple	banana
2	avocado, grapes, banana, apple, kiwi

与 'draws' 连接在一起。

我需要在 聚合组 上执行此操作，因为我还有其他列也需要聚合。

有人能帮帮我吗？谢谢！

Answer 1

要获得首选项，首先将您的 DataFrame 拆分并扩展为更长的系列。然后，计算出现的次数并进行排名，另一个 groupby + agg 将让您加入第一和第二偏好的领带。

此结果的索引将是原始 DataFrame 中的唯一 'ID' 值，因此您可以 concat 此结果与其他 groupby + agg 行动

示例数据

import pandas as pd
df = pd.DataFrame({'ID': [1,1,1,2,2],
                   'Preferences': ['banana, apple', 'banana, apple, kiwi', 'avocado, apple', 'avocado, grapes',
                                   'banana, apple, kiwi']})

代码

# Expand to long Series
s = df.set_index(['ID']).Preferences.str.split(', ', expand=True).stack()

# Within each ID, rank preferences based on # of occurrences
s = (s.groupby([s.index.get_level_values(0), s.rename('preference')]).size()
      .groupby(level=0).rank(method='dense', ascending=False)
      .map({1: 'first', 2: 'second'}).rename('order'))

res = s[s.isin(['first', 'second'])].reset_index().groupby(['ID', 'order']).agg(', '.join).unstack(-1)

# Collapse MultiIndex to get simple column labels
res.columns = [f'{y}_{x}' for x,y in res.columns]

print(res)
                        first_preference second_preference
ID                                                        
1                                  apple            banana
2   apple, avocado, banana, grapes, kiwi               NaN

Pandas Groupby Agg：获取在列表列中出现次数最多的字符串

Pandas Groupby Agg: Get strings that appear the most in column of lists

python

string

group-by

list

pandas

示例数据

代码