如何根据多列的唯一 ID 获取最频繁出现的次数?

How to get most frequent occurrences per unique id of multiple columns?

我有一个数据集,其中有一个唯一 customer_id 和几个 order_id 每次一个唯一的客户进行购买。这是关于老花镜的,所以我删除了产品信息,现在我只有老花镜的强度(+1、+4、+2.5 等)。

此功能的数据框如下所示: screenshot

我试过很多东西,例如:

testdf = testdf.groupby(['customer_id', 'order_id'])['order_item1', 'order_item2', 'order_item3', 'order_item4', 'order_item5']\
    .agg(list)\
    .apply(lambda x:list(combinations(set(x),2)))\
    .explode()

和:

def top_product(g):

    product_cols = [col for col in g.columns if col.startswith('order_item')]
    try:
        out = (g[product_cols].stack().value_counts(normalize=True)
                             .reset_index().iloc[0])
        out.index = ['most_product']
        return out
    except IndexError:
        return pd.Series({'order_item': 'None', 'most_product' : 0})

output = testdf.groupby('order_id').apply(top_product)

两者都不行。我想知道每个客户购买最多的产品。所以对于 customer_id 11795 它将是 2.5。知道如何做到这一点吗?

我会首先使用 .melt() 重塑数据框。然后您可以在该列上执行 .value_counts()。最后,按计数对数据框进行排序并删除重复项 customer_ids,保留第一个,这将为每个客户留下最高计数。

import pandas as pd
import numpy as np


data = {
        'order_id':[33163,38596,35326,46139,57446,65838,71228],
        'customer_id':[11795,11795,10613,10613,5729,5729,5729],
        'order_item1':[2.5,2.5,2,-2.5,2.5,2.5,2.5],
        'order_item2':[np.nan,2,2.5,np.nan,2.5,2.5,np.nan],
        'order_item3':[np.nan,np.nan,2,np.nan,np.nan,np.nan,np.nan],
        'order_item4':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
        'order_item5':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan]}

df = pd.DataFrame(data)

数据

print(df)
   order_id  customer_id  order_item1  ...  order_item3  order_item4  order_item5
0     33163        11795          2.5  ...          NaN          NaN          NaN
1     38596        11795          2.5  ...          NaN          NaN          NaN
2     35326        10613          2.0  ...          2.0          NaN          NaN
3     46139        10613         -2.5  ...          NaN          NaN          NaN
4     57446         5729          2.5  ...          NaN          NaN          NaN
5     65838         5729          2.5  ...          NaN          NaN          NaN
6     71228         5729          2.5  ...          NaN          NaN          NaN

[7 rows x 7 columns]

熔化:使用'order_item'列熔化为值:

itemCols = [x for x in df.columns if 'order_item' in x]
df_melt = pd.melt(df, id_vars='customer_id', value_vars=itemCols).dropna(subset='value')

熔化输出:

print(df_melt)
    customer_id     variable  value
0         11795  order_item1    2.5
1         11795  order_item1    2.5
2         10613  order_item1    2.0
3         10613  order_item1   -2.5
4          5729  order_item1    2.5
5          5729  order_item1    2.5
6          5729  order_item1    2.5
8         11795  order_item2    2.0
9         10613  order_item2    2.5
11         5729  order_item2    2.5
12         5729  order_item2    2.5
16        10613  order_item3    2.0

'value' 列的值:

value_counts = df_melt.groupby('customer_id')['value'].value_counts().rename('count').reset_index()
value_counts = value_counts.sort_values(['customer_id', 'count'], ascending=[True, False])

值计数输出:

print(value_counts)
   customer_id  value  count
0         5729    2.5      5
1        10613    2.0      2
2        10613   -2.5      1
3        10613    2.5      1
4        11795    2.5      2
5        11795    2.0      1

删除重复的客户 ID,保留第一个:

top_sales = value_counts.drop_duplicates(subset='customer_id', keep='first')

输出:

print(top_sales)
   customer_id  value  count
0         5729    2.5      5
1        10613    2.0      2
4        11795    2.5      2

完整代码:

import pandas as pd
import numpy as np


data = {
        'order_id':[33163,38596,35326,46139,57446,65838,71228],
        'customer_id':[11795,11795,10613,10613,5729,5729,5729],
        'order_item1':[2.5,2.5,2,-2.5,2.5,2.5,2.5],
        'order_item2':[np.nan,2,2.5,np.nan,2.5,2.5,np.nan],
        'order_item3':[np.nan,np.nan,2,np.nan,np.nan,np.nan,np.nan],
        'order_item4':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
        'order_item5':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan]}

df = pd.DataFrame(data)

itemCols = [x for x in df.columns if 'order_item' in x]
df_melt = pd.melt(df, id_vars='customer_id', value_vars=itemCols).dropna(subset='value')

value_counts = df_melt.groupby('customer_id')['value'].value_counts().rename('count').reset_index()
value_counts = value_counts.sort_values(['customer_id', 'count'], ascending=[True, False])

top_sales = value_counts.drop_duplicates(subset='customer_id', keep='first')