使用 `itertools` 组合 DataFrame 列
Using `itertools` to combine DataFrame columns
我有一个 DataFrame
看起来像这个玩具示例:
import pandas as pd
df = np.array([
[20.078,19.679,19.585,19.406,19.37,14.97,13.992,20.122,20.736],
[20.443,19.115,18.918,18.749,18.698,14.638,14.041,21.646,21.456],
[19.723,19.593,19.353,19.175,19.258,15.193,14.354,21.122,21.09],
[19.683,19.393,19.273,18.995,18.95,15.545,14.53,22.465,20.091],
[19.769,19.233,19.083,18.983,18.768,14.978,14.224,21.684,20.314],
[19.908,19.5,19.065,18.838,18.354,13.837,13.016,21.307,21.234]
])
df = pd.DataFrame(df, columns = ['u', 'g', 'r', 'i', 'zmag', 'W1', 'W2', 'NUV', 'FUV'])
我想将列成对地组合成减法组合,这就是这个片段的作用:
df['FUV_NUV'] = dataset['FUV'] - dataset['NUV']
df['FUV_u'] = df['FUV'] - df['u']
df['u_g'] = df['u'] - df['g']
df['g_r'] = df['g'] - df['r']
df['r_i'] = df['r'] - df['i']
df['i_z'] = df['i'] - df['zmag']
df['z_W1'] = df['zmag'] - df['W1']
df['W1_W2'] = df['W1'] - df['W2']
当然,我发现了更好的方法 here:
combs = list(chain.from_iterable(combinations(df.columns, i)
for i in range(2, len(df.columns) + 1)))
for cols in combs:
df['_'.join(cols)] = df.loc[:, cols].sum(axis=1)
但是,这会产生 所有 组合(例如 u+g+W1+W2+...
)。
我怎样才能改变它
- 遍历列以生成最大两列的所有组合(例如:u-g、u-r、u-i、u-zmag,...)
- 减法(即不是
sum
)?
尝试 df.diff(axis=1)
并查看函数的 documentation。
Calculates the difference of a Dataframe element compared with another element in the Dataframe (default is element in previous row).
当我们使用axis=1
时,这个函数获取一列和前一列的差值。因此,您可能必须重新排序列才能使此功能正常工作。
这是另一个使用循环的选项。
在我看来,使用 itertools
的方法比简单地写出要减去的列的可读性差。
我还建议在数据帧上设置数据时使用 df.loc[]
语法。
pairs = [
("FUV", "NUV"),
("FUV", "u"),
("u", "g"),
("g", "r"),
("r", "i"),
("i", "zmag"),
("zmag", "W1"),
("W1", "W2"),
]
for col1, col2 in pairs:
new_col_name = f"{col1}_{col2}"
df.loc[:, new_col_name] = df[col1] - df[col2]
您可以使用此列表理解生成包含环绕的相邻列的 2 个元素组合的列表
然后你可以通过遍历这个列表来生成额外的列
comb = [(x, df.columns[(i+1) % len(df.columns)]) for i, x in enumerate(df.columns)]
for x, y in comb:
df[f'{x}_{y}'] = df[x] - df[y]
这会产生输出:
u g r i zmag W1 W2 NUV FUV u_g g_r r_i i_zmag zmag_W1 W1_W2 W2_NUV NUV_FUV FUV_u
0 20.078 19.679 19.585 19.406 19.370 14.970 13.992 20.122 20.736 0.399 0.094 0.179 0.036 4.400 0.978 -6.130 -0.614 0.658
1 20.443 19.115 18.918 18.749 18.698 14.638 14.041 21.646 21.456 1.328 0.197 0.169 0.051 4.060 0.597 -7.605 0.190 1.013
2 19.723 19.593 19.353 19.175 19.258 15.193 14.354 21.122 21.090 0.130 0.240 0.178 -0.083 4.065 0.839 -6.768 0.032 1.367
3 19.683 19.393 19.273 18.995 18.950 15.545 14.530 22.465 20.091 0.290 0.120 0.278 0.045 3.405 1.015 -7.935 2.374 0.408
4 19.769 19.233 19.083 18.983 18.768 14.978 14.224 21.684 20.314 0.536 0.150 0.100 0.215 3.790 0.754 -7.460 1.370 0.545
5 19.908 19.500 19.065 18.838 18.354 13.837 13.016 21.307 21.234 0.408 0.435 0.227 0.484 4.517 0.821 -8.291 0.073 1.326
我有一个 DataFrame
看起来像这个玩具示例:
import pandas as pd
df = np.array([
[20.078,19.679,19.585,19.406,19.37,14.97,13.992,20.122,20.736],
[20.443,19.115,18.918,18.749,18.698,14.638,14.041,21.646,21.456],
[19.723,19.593,19.353,19.175,19.258,15.193,14.354,21.122,21.09],
[19.683,19.393,19.273,18.995,18.95,15.545,14.53,22.465,20.091],
[19.769,19.233,19.083,18.983,18.768,14.978,14.224,21.684,20.314],
[19.908,19.5,19.065,18.838,18.354,13.837,13.016,21.307,21.234]
])
df = pd.DataFrame(df, columns = ['u', 'g', 'r', 'i', 'zmag', 'W1', 'W2', 'NUV', 'FUV'])
我想将列成对地组合成减法组合,这就是这个片段的作用:
df['FUV_NUV'] = dataset['FUV'] - dataset['NUV']
df['FUV_u'] = df['FUV'] - df['u']
df['u_g'] = df['u'] - df['g']
df['g_r'] = df['g'] - df['r']
df['r_i'] = df['r'] - df['i']
df['i_z'] = df['i'] - df['zmag']
df['z_W1'] = df['zmag'] - df['W1']
df['W1_W2'] = df['W1'] - df['W2']
当然,我发现了更好的方法 here:
combs = list(chain.from_iterable(combinations(df.columns, i)
for i in range(2, len(df.columns) + 1)))
for cols in combs:
df['_'.join(cols)] = df.loc[:, cols].sum(axis=1)
但是,这会产生 所有 组合(例如 u+g+W1+W2+...
)。
我怎样才能改变它
- 遍历列以生成最大两列的所有组合(例如:u-g、u-r、u-i、u-zmag,...)
- 减法(即不是
sum
)?
尝试 df.diff(axis=1)
并查看函数的 documentation。
Calculates the difference of a Dataframe element compared with another element in the Dataframe (default is element in previous row).
当我们使用axis=1
时,这个函数获取一列和前一列的差值。因此,您可能必须重新排序列才能使此功能正常工作。
这是另一个使用循环的选项。
在我看来,使用 itertools
的方法比简单地写出要减去的列的可读性差。
我还建议在数据帧上设置数据时使用 df.loc[]
语法。
pairs = [
("FUV", "NUV"),
("FUV", "u"),
("u", "g"),
("g", "r"),
("r", "i"),
("i", "zmag"),
("zmag", "W1"),
("W1", "W2"),
]
for col1, col2 in pairs:
new_col_name = f"{col1}_{col2}"
df.loc[:, new_col_name] = df[col1] - df[col2]
您可以使用此列表理解生成包含环绕的相邻列的 2 个元素组合的列表 然后你可以通过遍历这个列表来生成额外的列
comb = [(x, df.columns[(i+1) % len(df.columns)]) for i, x in enumerate(df.columns)]
for x, y in comb:
df[f'{x}_{y}'] = df[x] - df[y]
这会产生输出:
u g r i zmag W1 W2 NUV FUV u_g g_r r_i i_zmag zmag_W1 W1_W2 W2_NUV NUV_FUV FUV_u
0 20.078 19.679 19.585 19.406 19.370 14.970 13.992 20.122 20.736 0.399 0.094 0.179 0.036 4.400 0.978 -6.130 -0.614 0.658
1 20.443 19.115 18.918 18.749 18.698 14.638 14.041 21.646 21.456 1.328 0.197 0.169 0.051 4.060 0.597 -7.605 0.190 1.013
2 19.723 19.593 19.353 19.175 19.258 15.193 14.354 21.122 21.090 0.130 0.240 0.178 -0.083 4.065 0.839 -6.768 0.032 1.367
3 19.683 19.393 19.273 18.995 18.950 15.545 14.530 22.465 20.091 0.290 0.120 0.278 0.045 3.405 1.015 -7.935 2.374 0.408
4 19.769 19.233 19.083 18.983 18.768 14.978 14.224 21.684 20.314 0.536 0.150 0.100 0.215 3.790 0.754 -7.460 1.370 0.545
5 19.908 19.500 19.065 18.838 18.354 13.837 13.016 21.307 21.234 0.408 0.435 0.227 0.484 4.517 0.821 -8.291 0.073 1.326