Python（合并）创建品牌 choice/purchase 数据集

Question

我正在尝试从多个 csv 创建一个选择模型数据集（想想个人是否以给定的价格购买某个品牌的产品）。

我的数据的一个小表示：

import pandas as pd

d1 = {'Product': [1,1,2,2,3], 'Price': [25, 25, 22, 22,35], 'Buyer ID': ['A','B','C','D','E']}
df1 = pd.DataFrame(d1)

其中 df1 包含与买家考虑购买的产品相关的信息。请注意，消费者在做出购买决定时可以使用所有三种产品 (1,2 &3)。

d2 = {'Buyer Num': ['A','B','E'], 'Product': [1,1,3,], 'Purchase Decision': ['Yes','Yes','Yes']}
df2 = pd.DataFrame(d2)

df2 包含有关消费者最终购买了哪种产品的信息。消费者A、B、E分别购买了产品1、1、3。

我尝试使用外连接和内连接合并这两个数据集。例如：

df3 = df1.merge(df2,left_on='Buyer ID', right_on='Buyer Num', how='outer')

我从外部连接得到的是：

   Buyer ID  Price  Product_x Buyer Num  Product_y Purchase Decision
     A        25          1         A        1.0               Yes
     B        25          1         B        1.0               Yes
     C        22          2       NaN        NaN               NaN
     D        22          2       NaN        NaN               NaN
     E        35          3         E        3.0               Yes

然而我最想要的是这样的-

Buyer ID    Price   Product  Purchase Decision
A             25        1      Yes
B             25        1      Yes
C             25        1      No
D             25        1      No
E             25        1      No
A             22        2      No
B             22        2      No
C             22        2      No
D             22        2      No
E             22        2      No
A             35        3      No
B             35        3      No
C             35        3      No
D             35        3      No
E             35        3      Yes

有人可以告诉我如何在 Python 上执行此操作吗？

Answer 1

你可以试试：

from itertools import product

# Outer merge and drop the unwanted column
df = pd.merge(df1, df2, left_on=['Buyer ID', 'Product'], right_on=['Buyer Num', 'Product'], 
              how='outer').drop('Buyer Num', axis=1)

# Generate cartesian product of 'Buyer ID' & 'Price' after retrieving unique values 
midx = product(df1['Buyer ID'].unique(), df1['Price'].unique())
# Set the earlier columns as index and reindex based on the obtained cartesian product values
d = df.set_index(['Buyer ID', 'Price']).reindex(midx)
# Fill Nans in 'Product' with the finite value in each sub-group of level 1 grouped index
d['Product'].fillna(d.groupby(level='Price')['Product'].transform('first'), inplace=True)
# Fill the remaining Nans with "No"
d.fillna('No').sort_values('Product').reset_index()

Python（合并）创建品牌 choice/purchase 数据集

Python (merge) create brand choice/purchase dataset

python

merge

pandas

data-cleaning