每两列和两个度量堆叠
Stack per two columns and two measures
我有这样的数据:
order_id Product_A Product_B Price_Product_A Price_Product_B
100 Pen Notebook 1.5 3
101 Bag Watch 10 12
我需要它看起来像这样:
order_id product price
100 Pen 1.5
100 Notebook 3
101 Bag 10
101 Watch 12
如何为此使用 stack() 和 unstack()?我只用它来衡量一个数字。
我会简单地创建两个数据框:一个用于产品 A,一个用于产品 B。然后为两者设置列名并像这样附加它们:
df1 = df[['order_id', 'Product_A', 'Price_Product_A']]
df2 = df[['order_id', 'Product_B', 'Price_Product_B']]
df1.columns = ['order_id', 'product', 'price']
df2.columns = ['order_id', 'product', 'price']
df = df1.append(df2)
df
输出:
order_id product price
0 100 Pen 1.5
1 101 Bag 10.0
0 100 Notebook 3.0
1 101 Watch 12.0
也许表示此数据的最佳方式是使用 multi-indexed 数据框。
这是为任意数量的订单和产品创建一个笨拙但有效的方法:
# list containing list of products for each order
prod_array = df[[column for column in df.columns if column[:-1] == 'Product_']].values
# list containing list of prices for each order
price_array = df[[column for column in df.columns if column[:-1] == 'Price_Product_']].values
# list of order ids
order_id_array = df['order_id']
# create empty dataframe
df_mi = pd.DataFrame(columns=["order_id","order_item_id","Product","Price_Product"])
# add rows
for i in range(len(order_id_array)):
for j in range(len(prod_array[i])):
df_mi.loc[df_mi.shape[0]] = [order_id_array[i], j, prod_array[i][j], price_array[i][j]]
# create multiindex dataframe
df_mi = df_mi.sort_values(['order_id','order_item_id']).set_index(['order_id','order_item_id'])
导致此数据框: multi-index table image
或者将我的解决方案与 JANO 的解决方案相结合:
order_prod_ids = [col[-1] for col in df.columns if col[:-1] == 'Product_']
# create empty dataframe
df_mi = pd.DataFrame(columns=["order_id","product","price","order_prod_id"])
for opid in order_prod_ids:
df_opid = df[['order_id', 'Product_'+opid, 'Price_Product_'+opid]]
df_opid.columns = ['order_id', 'product', 'price']
df_opid['order_prod_id'] = [opid]*df_opid.shape[0]
df_mi = df_mi.append(df_opid)
df_mi = df_mi.sort_values(['order_id','order_prod_id']).set_index(['order_id','order_prod_id'])
有一个方便的函数,wide_to_long
:
pd.wide_to_long(df, ['Product','Price_Product'], i='order_id', j='subtype', sep = '_', suffix = '\D+')
输出:
Product Price_Product
order_id subtype
100 A Pen 1.5
101 A Bag 10.0
100 B Notebook 3.0
101 B Watch 12.0
用melt
和unstack
也可以达到同样的效果,具有一定的借鉴意义。有点棘手的是将 'variable'
分成两部分,根和后缀,wide_to_long
可以帮助您。对于您的示例,这可能如下所示:
df1 = df.melt(id_vars = 'order_id')
df1['cat'] = df1['variable'].str[:-2] # you may have to tweak this for your actual data
df1['subtype'] = df1['variable'].str[-1:] # you may have to tweak this for your actual data
(df1.drop(columns = 'variable')
.set_index(['order_id','subtype','cat'])
.unstack()
.droplevel(level=0, axis=1)
.reset_index()
)