如何根据多列的唯一 ID 获取最频繁出现的次数?
How to get most frequent occurrences per unique id of multiple columns?
我有一个数据集,其中有一个唯一 customer_id 和几个 order_id 每次一个唯一的客户进行购买。这是关于老花镜的,所以我删除了产品信息,现在我只有老花镜的强度(+1、+4、+2.5 等)。
此功能的数据框如下所示:
screenshot
我试过很多东西,例如:
testdf = testdf.groupby(['customer_id', 'order_id'])['order_item1', 'order_item2', 'order_item3', 'order_item4', 'order_item5']\
.agg(list)\
.apply(lambda x:list(combinations(set(x),2)))\
.explode()
和:
def top_product(g):
product_cols = [col for col in g.columns if col.startswith('order_item')]
try:
out = (g[product_cols].stack().value_counts(normalize=True)
.reset_index().iloc[0])
out.index = ['most_product']
return out
except IndexError:
return pd.Series({'order_item': 'None', 'most_product' : 0})
output = testdf.groupby('order_id').apply(top_product)
两者都不行。我想知道每个客户购买最多的产品。所以对于 customer_id 11795 它将是 2.5。知道如何做到这一点吗?
我会首先使用 .melt()
重塑数据框。然后您可以在该列上执行 .value_counts()
。最后,按计数对数据框进行排序并删除重复项 customer_ids,保留第一个,这将为每个客户留下最高计数。
import pandas as pd
import numpy as np
data = {
'order_id':[33163,38596,35326,46139,57446,65838,71228],
'customer_id':[11795,11795,10613,10613,5729,5729,5729],
'order_item1':[2.5,2.5,2,-2.5,2.5,2.5,2.5],
'order_item2':[np.nan,2,2.5,np.nan,2.5,2.5,np.nan],
'order_item3':[np.nan,np.nan,2,np.nan,np.nan,np.nan,np.nan],
'order_item4':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'order_item5':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan]}
df = pd.DataFrame(data)
数据
print(df)
order_id customer_id order_item1 ... order_item3 order_item4 order_item5
0 33163 11795 2.5 ... NaN NaN NaN
1 38596 11795 2.5 ... NaN NaN NaN
2 35326 10613 2.0 ... 2.0 NaN NaN
3 46139 10613 -2.5 ... NaN NaN NaN
4 57446 5729 2.5 ... NaN NaN NaN
5 65838 5729 2.5 ... NaN NaN NaN
6 71228 5729 2.5 ... NaN NaN NaN
[7 rows x 7 columns]
熔化:使用'order_item'
列熔化为值:
itemCols = [x for x in df.columns if 'order_item' in x]
df_melt = pd.melt(df, id_vars='customer_id', value_vars=itemCols).dropna(subset='value')
熔化输出:
print(df_melt)
customer_id variable value
0 11795 order_item1 2.5
1 11795 order_item1 2.5
2 10613 order_item1 2.0
3 10613 order_item1 -2.5
4 5729 order_item1 2.5
5 5729 order_item1 2.5
6 5729 order_item1 2.5
8 11795 order_item2 2.0
9 10613 order_item2 2.5
11 5729 order_item2 2.5
12 5729 order_item2 2.5
16 10613 order_item3 2.0
'value'
列的值:
value_counts = df_melt.groupby('customer_id')['value'].value_counts().rename('count').reset_index()
value_counts = value_counts.sort_values(['customer_id', 'count'], ascending=[True, False])
值计数输出:
print(value_counts)
customer_id value count
0 5729 2.5 5
1 10613 2.0 2
2 10613 -2.5 1
3 10613 2.5 1
4 11795 2.5 2
5 11795 2.0 1
删除重复的客户 ID,保留第一个:
top_sales = value_counts.drop_duplicates(subset='customer_id', keep='first')
输出:
print(top_sales)
customer_id value count
0 5729 2.5 5
1 10613 2.0 2
4 11795 2.5 2
完整代码:
import pandas as pd
import numpy as np
data = {
'order_id':[33163,38596,35326,46139,57446,65838,71228],
'customer_id':[11795,11795,10613,10613,5729,5729,5729],
'order_item1':[2.5,2.5,2,-2.5,2.5,2.5,2.5],
'order_item2':[np.nan,2,2.5,np.nan,2.5,2.5,np.nan],
'order_item3':[np.nan,np.nan,2,np.nan,np.nan,np.nan,np.nan],
'order_item4':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'order_item5':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan]}
df = pd.DataFrame(data)
itemCols = [x for x in df.columns if 'order_item' in x]
df_melt = pd.melt(df, id_vars='customer_id', value_vars=itemCols).dropna(subset='value')
value_counts = df_melt.groupby('customer_id')['value'].value_counts().rename('count').reset_index()
value_counts = value_counts.sort_values(['customer_id', 'count'], ascending=[True, False])
top_sales = value_counts.drop_duplicates(subset='customer_id', keep='first')
我有一个数据集,其中有一个唯一 customer_id 和几个 order_id 每次一个唯一的客户进行购买。这是关于老花镜的,所以我删除了产品信息,现在我只有老花镜的强度(+1、+4、+2.5 等)。
此功能的数据框如下所示: screenshot
我试过很多东西,例如:
testdf = testdf.groupby(['customer_id', 'order_id'])['order_item1', 'order_item2', 'order_item3', 'order_item4', 'order_item5']\
.agg(list)\
.apply(lambda x:list(combinations(set(x),2)))\
.explode()
和:
def top_product(g):
product_cols = [col for col in g.columns if col.startswith('order_item')]
try:
out = (g[product_cols].stack().value_counts(normalize=True)
.reset_index().iloc[0])
out.index = ['most_product']
return out
except IndexError:
return pd.Series({'order_item': 'None', 'most_product' : 0})
output = testdf.groupby('order_id').apply(top_product)
两者都不行。我想知道每个客户购买最多的产品。所以对于 customer_id 11795 它将是 2.5。知道如何做到这一点吗?
我会首先使用 .melt()
重塑数据框。然后您可以在该列上执行 .value_counts()
。最后,按计数对数据框进行排序并删除重复项 customer_ids,保留第一个,这将为每个客户留下最高计数。
import pandas as pd
import numpy as np
data = {
'order_id':[33163,38596,35326,46139,57446,65838,71228],
'customer_id':[11795,11795,10613,10613,5729,5729,5729],
'order_item1':[2.5,2.5,2,-2.5,2.5,2.5,2.5],
'order_item2':[np.nan,2,2.5,np.nan,2.5,2.5,np.nan],
'order_item3':[np.nan,np.nan,2,np.nan,np.nan,np.nan,np.nan],
'order_item4':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'order_item5':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan]}
df = pd.DataFrame(data)
数据
print(df)
order_id customer_id order_item1 ... order_item3 order_item4 order_item5
0 33163 11795 2.5 ... NaN NaN NaN
1 38596 11795 2.5 ... NaN NaN NaN
2 35326 10613 2.0 ... 2.0 NaN NaN
3 46139 10613 -2.5 ... NaN NaN NaN
4 57446 5729 2.5 ... NaN NaN NaN
5 65838 5729 2.5 ... NaN NaN NaN
6 71228 5729 2.5 ... NaN NaN NaN
[7 rows x 7 columns]
熔化:使用'order_item'
列熔化为值:
itemCols = [x for x in df.columns if 'order_item' in x]
df_melt = pd.melt(df, id_vars='customer_id', value_vars=itemCols).dropna(subset='value')
熔化输出:
print(df_melt)
customer_id variable value
0 11795 order_item1 2.5
1 11795 order_item1 2.5
2 10613 order_item1 2.0
3 10613 order_item1 -2.5
4 5729 order_item1 2.5
5 5729 order_item1 2.5
6 5729 order_item1 2.5
8 11795 order_item2 2.0
9 10613 order_item2 2.5
11 5729 order_item2 2.5
12 5729 order_item2 2.5
16 10613 order_item3 2.0
'value'
列的值:
value_counts = df_melt.groupby('customer_id')['value'].value_counts().rename('count').reset_index()
value_counts = value_counts.sort_values(['customer_id', 'count'], ascending=[True, False])
值计数输出:
print(value_counts)
customer_id value count
0 5729 2.5 5
1 10613 2.0 2
2 10613 -2.5 1
3 10613 2.5 1
4 11795 2.5 2
5 11795 2.0 1
删除重复的客户 ID,保留第一个:
top_sales = value_counts.drop_duplicates(subset='customer_id', keep='first')
输出:
print(top_sales)
customer_id value count
0 5729 2.5 5
1 10613 2.0 2
4 11795 2.5 2
完整代码:
import pandas as pd
import numpy as np
data = {
'order_id':[33163,38596,35326,46139,57446,65838,71228],
'customer_id':[11795,11795,10613,10613,5729,5729,5729],
'order_item1':[2.5,2.5,2,-2.5,2.5,2.5,2.5],
'order_item2':[np.nan,2,2.5,np.nan,2.5,2.5,np.nan],
'order_item3':[np.nan,np.nan,2,np.nan,np.nan,np.nan,np.nan],
'order_item4':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'order_item5':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan]}
df = pd.DataFrame(data)
itemCols = [x for x in df.columns if 'order_item' in x]
df_melt = pd.melt(df, id_vars='customer_id', value_vars=itemCols).dropna(subset='value')
value_counts = df_melt.groupby('customer_id')['value'].value_counts().rename('count').reset_index()
value_counts = value_counts.sort_values(['customer_id', 'count'], ascending=[True, False])
top_sales = value_counts.drop_duplicates(subset='customer_id', keep='first')