使用 `itertools` 组合 DataFrame 列

Using `itertools` to combine DataFrame columns

我有一个 DataFrame 看起来像这个玩具示例:

import pandas as pd
df = np.array([
    [20.078,19.679,19.585,19.406,19.37,14.97,13.992,20.122,20.736],
    [20.443,19.115,18.918,18.749,18.698,14.638,14.041,21.646,21.456],
    [19.723,19.593,19.353,19.175,19.258,15.193,14.354,21.122,21.09],
    [19.683,19.393,19.273,18.995,18.95,15.545,14.53,22.465,20.091],
    [19.769,19.233,19.083,18.983,18.768,14.978,14.224,21.684,20.314],
    [19.908,19.5,19.065,18.838,18.354,13.837,13.016,21.307,21.234]
])

df = pd.DataFrame(df, columns = ['u', 'g', 'r', 'i', 'zmag', 'W1', 'W2', 'NUV', 'FUV'])

我想将列成对地组合成减法组合,这就是这个片段的作用:

df['FUV_NUV'] = dataset['FUV'] - dataset['NUV']
df['FUV_u'] = df['FUV'] - df['u']
df['u_g'] = df['u'] - df['g']
df['g_r'] = df['g'] - df['r']
df['r_i'] = df['r'] - df['i']
df['i_z'] = df['i'] - df['zmag']
df['z_W1'] = df['zmag'] - df['W1']
df['W1_W2'] = df['W1'] - df['W2']

当然,我发现了更好的方法 here:

combs = list(chain.from_iterable(combinations(df.columns, i)
                                 for i in range(2, len(df.columns) + 1)))
for cols in combs:
    df['_'.join(cols)] = df.loc[:, cols].sum(axis=1)

但是,这会产生 所有 组合(例如 u+g+W1+W2+...)。

我怎样才能改变它

  1. 遍历列以生成最大两列的所有组合(例如:u-g、u-r、u-i、u-zmag,...)
  2. 减法(即不是 sum)?

尝试 df.diff(axis=1) 并查看函数的 documentation

Calculates the difference of a Dataframe element compared with another element in the Dataframe (default is element in previous row).

当我们使用axis=1时,这个函数获取一列和前一列的差值。因此,您可能必须重新排序列才能使此功能正常工作。


这是另一个使用循环的选项。

在我看来,使用 itertools 的方法比简单地写出要减去的列的可读性差。

我还建议在数据帧上设置数据时使用 df.loc[] 语法。

pairs = [
    ("FUV", "NUV"),
    ("FUV", "u"),
    ("u", "g"),
    ("g", "r"),
    ("r", "i"),
    ("i", "zmag"),
    ("zmag", "W1"),
    ("W1", "W2"),
]

for col1, col2 in pairs:
    new_col_name = f"{col1}_{col2}"
    df.loc[:, new_col_name] = df[col1] - df[col2]

您可以使用此列表理解生成包含环绕的相邻列的 2 个元素组合的列表 然后你可以通过遍历这个列表来生成额外的列

comb = [(x, df.columns[(i+1) % len(df.columns)]) for i, x in enumerate(df.columns)]


for x, y in comb:
  df[f'{x}_{y}'] = df[x] - df[y]

这会产生输出:

        u       g       r       i    zmag      W1      W2     NUV     FUV    u_g    g_r    r_i  i_zmag  zmag_W1  W1_W2  W2_NUV  NUV_FUV  FUV_u
0  20.078  19.679  19.585  19.406  19.370  14.970  13.992  20.122  20.736  0.399  0.094  0.179   0.036    4.400  0.978  -6.130   -0.614  0.658
1  20.443  19.115  18.918  18.749  18.698  14.638  14.041  21.646  21.456  1.328  0.197  0.169   0.051    4.060  0.597  -7.605    0.190  1.013
2  19.723  19.593  19.353  19.175  19.258  15.193  14.354  21.122  21.090  0.130  0.240  0.178  -0.083    4.065  0.839  -6.768    0.032  1.367
3  19.683  19.393  19.273  18.995  18.950  15.545  14.530  22.465  20.091  0.290  0.120  0.278   0.045    3.405  1.015  -7.935    2.374  0.408
4  19.769  19.233  19.083  18.983  18.768  14.978  14.224  21.684  20.314  0.536  0.150  0.100   0.215    3.790  0.754  -7.460    1.370  0.545
5  19.908  19.500  19.065  18.838  18.354  13.837  13.016  21.307  21.234  0.408  0.435  0.227   0.484    4.517  0.821  -8.291    0.073  1.326