使用 multiindex pandas 数据框动态创建列是否没有语法糖？

Question

首先，我展示 pandas 数据框来阐明我的问题。

import pandas as pd
mi = pd.MultiIndex.from_product([["A","B"],["c","d"]], names=['lv1', 'lv2'])
df1 = pd.DataFrame([[1,2,3,4],[5,6,7,8],[9,10,11,12]],columns=mi)

此 python 代码创建数据帧 (df1) 如下所示：

#input dataframe
lv1  A       B
lv2  c   d   c   d
0    1   2   3   4
1    5   6   7   8
2    9  10  11  12

我想使用 df1 的数据在 lv2 上创建列 'c*d'。像这样：

#output dataframe after calculation
lv1  A           B        
lv2  c   d  c*d  c    d  c*d
0    1   2    2  3    4   12
1    5   6   30  7    8   56
2    9  10   90  11  12  132

针对这个问题，我写了一些这样的代码：

for l1 in mi.levels[0]:
    df1.loc[:, (l1, "c*d")] = df1.loc[:,(l1,"c")]*df1.loc[:,(l1,"d")]
df1.sort_index(1,inplace=True)

虽然这段代码几乎解决了我的问题，但我真的很想在没有 'for' 语句的情况下编写这样的代码：

df1.loc[:,(slice(None),"c*d")]=df1.loc[:,(slice(None),"c")]*df1.loc[:,(slice(None),"d")]

有了这个语句，我得到了关键错误，说 'c*d' 丢失了。这个计算没有语法糖吗？或者我可以通过其他代码获得更好的性能吗？

Answer 1

稍微改进了您的解决方案：

for l1 in mi.levels[0]:
    df1.loc[:, (l1, "c*d")] = df1.loc[:,(l1,"c")]*df1.loc[:,(l1,"d")]
mux = pd.MultiIndex.from_product([df1.columns.levels[0], ['c','d','c*d']])
df1 = df1.reindex(columns=mux)
print (df1)
   A           B         
   c   d c*d   c   d  c*d
0  1   2   2   3   4   12
1  5   6  30   7   8   56
2  9  10  90  11  12  132

stack 和 unstack 的另一个解决方案：

mux = pd.MultiIndex.from_product([df1.columns.levels[0], ['c','d','c_d']])
df1 = df1.stack(0)
         .assign(c_d = lambda x: x.sum(1))
         .unstack()
         .swaplevel(0,1,1)
         .reindex(columns=mux)
print (df1)
   A           B        
   c   d c_d   c   d c_d
0  1   2   3   3   4   7
1  5   6  11   7   8  15
2  9  10  19  11  12  23

df2 = df1.xs("c", axis=1, level=1).mul(df1.xs("d", axis=1, level=1))
df2.columns = pd.MultiIndex.from_product([df2.columns, ['c*d']])
print (df2)
    A    B
  c*d  c*d
0   2   12
1  30   56
2  90  132

mux = pd.MultiIndex.from_product([df2.columns.levels[0], ['c','d','c*d']])
df = df1.join(df2).reindex(columns=mux)
print (df)
   A           B         
   c   d c*d   c   d  c*d
0  1   2   2   3   4   12
1  5   6  30   7   8   56
2  9  10  90  11  12  132

Answer 2

使用堆栈解释 jezrael 的答案，这可能是 pandas 中最惯用的方式。

output = (df1
             # "Stack" data, by moving the top level ('lv1') of the
             # column MultiIndex into row index,
             # now the rows are a MultiIndex and the columns
             # are a regular Index.
             .stack(0)

             # Since we only have 2 columns now, 'lv2' ('c' & 'd')
             # we can multiply them together along the row axis.
             # The assign method takes key=value pairs mapping new column
             # names to the function used to calculate them. Here we're
             # wrapping them in a dictionary and unpacking them using **
             .assign(**{'c*d': lambda x: x.product(axis=1)})

             # Undos the stack operation, moving 'lv1', back to the
             # column index, but now as the bottom level of the column index
             .unstack()

             # This sets the order of the column index MultiIndex levels.
             # Since they are named we can use the names, you can also use
             # their integer positions instead. Here axis=1 references
             # the column index
             .swaplevel('lv1', 'lv2', axis=1)

             # Sort the values in both levels of the column MultiIndex.
             # This will order them as c, c*d, d which is not what you
             # specified above, however having a sorted MultiIndex is required
             # for indexing via .loc[:, (...)] to work properly
             .sort_index(axis=1)
          )

使用 multiindex pandas 数据框动态创建列是否没有语法糖？

Is there no syntax suger for dynamic creating columns with multiindexed pandas dataframe?

python

dynamic-allocation

python-3.x

pandas