根据其他列值对多个数据框列进行分组

Grouping several dataframe columns based on another columns values

我有这个数据框:

    refid   col2    price1  factor1 price2  factor2 price3  factor3
  0   1       a      200       1      180     3     150       10
  1   2       b      500       1      450     3     400       10
  2   3       c      700       1      620     2     550        5

我需要得到这个输出:

   refid    col2    price   factor
0   1        a      200       1
1   1        b      500       1
2   1        c      700       1
3   2        a      180       3
4   2        b      450       3
5   2        c      620       2
6   3        a      150       10
7   3        b      400       10
8   3        c      550       5

现在我正在尝试使用 df.melt 方法,但无法正常工作,这是代码和当前结果:

df2_melt = df2.melt(id_vars=["refid","col2"],
        value_vars=["price1","price2","price3",
                   "factor1","factor2","factor3"],
        var_name="Price", 
        value_name="factor")



    refid   col2    price   factor
0       1   a      price1   200
1       2   b      price1   500
2       3   c      price1   700
3       1   a      price2   180
4       2   b      price2   450
5       3   c      price2   620
6       1   a      price3   150
7       2   b      price3   400
8       3   c      price3   550
9       1   a      factor1  1
10      2   b      factor1  1
11      3   c      factor1  1
12      1   a      factor2  3
13      2   b      factor2  3
14      3   c      factor2  2
15      1   a      factor3  10
16      2   b      factor3  10
17      3   c      factor3  5

你可以熔化两次然后连接它们:

import pandas as pd  

df = pd.DataFrame({'refid': [1, 2, 3], 'col2': ['a', 'b', 'c'],
                   'price1': [200, 500, 700], 'factor1': [1, 1, 1],
                   'price2': [180, 450, 620], 'factor2': [3,3,2],
                   'price3': [150, 400, 550], 'factor3': [10, 10, 5]})
prices = [c for c in df if c.startswith('price')]
factors = [c for c in df if c.startswith('factor')]
df1 = pd.melt(df, id_vars=["refid","col2"], value_vars=prices, value_name='price').drop('variable', axis=1)
df2 = pd.melt(df, id_vars=["refid","col2"], value_vars=factors, value_name='factor').drop('variable', axis=1)
df3 = pd.concat([df1, df2['factor']],axis=1).reset_index().drop('index', axis=1)
print(df3)

这是输出:

     refid  col2  price  factor
0      1    a    200       1
1      2    b    500       1
2      3    c    700       1
3      1    a    180       3
4      2    b    450       3
5      3    c    620       2
6      1    a    150      10
7      2    b    400      10
8      3    c    550       5

由于您有一个带有通用前缀的宽 DataFrame,您可以使用 wide_to_long:

out = pd.wide_to_long(df, stubnames=['price','factor'], 
                      i=["refid","col2"], j='num').droplevel(-1).reset_index()

输出:

   refid col2  price  factor
0      1    a    200       1
1      1    a    180       3
2      1    a    150      10
3      2    b    500       1
4      2    b    450       3
5      2    b    400      10
6      3    c    700       1
7      3    c    620       2
8      3    c    550       5

请注意,您的预期输出有一个错误,其中 factors 与 refids 不一致。

一个选项是pivot_longer from pyjanitor:

# pip install pyjanitor
import janitor
import pandas as pd

(df
.pivot_longer(
    index = ['refid', 'col2'], 
    names_to = '.value', 
    names_pattern = r"(.+)\d", 
    sort_by_appearance = True)
)
   refid col2  price  factor
0      1    a    200       1
1      1    a    180       3
2      1    a    150      10
3      2    b    500       1
4      2    b    450       3
5      2    b    400      10
6      3    c    700       1
7      3    c    620       2
8      3    c    550       5

此特定重塑的想法是,正则表达式中与 .value 配对的任何组都保留为列 header。