如何有效地跨数据帧配对乘法

How do I pair multiplication across dataframe efficiently

我想使用多个数字特征进行特征工程,想法是跨数据框进行对乘法,首选答案是机器学习库中可用的东西,例如 TensorFlow, Keras, TPOT, H20,等等(我不知道知道这个过程的科学名称),但是没有图书馆也可以。

这是我的简化数据集

No  feature_1  feature_2  feature_3
1          10         20         30
2          20         30         40 

这是我需要的

No  feature_1  feature_2  feature_3  feature_1xfeature2  feature_1xfeature_2  feature_2xfeature_3
1          10         20         30                 200                  300                  600            
2          20         30         40                 600                  800                 1200

我做了什么

df['feature_1xfeature2'] =  df['feature_1'] * df['feature_2']
df['feature_1xfeature3'] =  df['feature_1'] * df['feature_3']
df['feature_2xfeature3'] =  df['feature_2'] * df['feature_3'] 

大量功能容易出错。如何自动执行此操作?

您可以使用 itertools 得到所有列的乘积:

import itertools

for col_a, col_b in itertools.product(df.columns, 2):
    df[col_a + 'x' + col_b] = df[col_a] * df[col_b]

itertools.product(df.columns, 2) 从 df.columns 中取出 2 项时生成所有列组合。

编辑

更详细地查看您的问题,我认为您最好改用 itertools.combinations。这不会产生所有可能的产品,但会产生所有可能的组合。

示例,假设列 'A'、'B'、'C'

itertools.product 产生 ('A', 'A'), ('A', 'B'), ('A', 'C'), ('B', 'A'), ('B', 'B'), ('B', 'C'), ('C', 'A'), ('C', 'B'), ('C', 'C').

itertools.combinations 产生 ('A', 'B'), ('A', 'C'), ('B', 'C')

因此,这样会更好:

import itertools

for col_a, col_b in itertools.combinations(df.columns, 2):
    df[col_a + 'x' + col_b] = df[col_a] * df[col_b]

还有其他更专业的方法可以自动完成。例如。 PolynomialFeatures:

import pandas as pd
from sklearn.preprocessing import PolynomialFeatures

# original data
df = pd.DataFrame(data = [[1, 10, 20, 30], [2, 20, 30, 40]], columns = ['No', 'feature_1', 'feature_2', 'feature_3'])

# selecting features to use for features generation
seed_feature_names = ['feature_1', 'feature_2', 'feature_3']
seed_features = pd.DataFrame(data = df[seed_feature_names], columns=seed_feature_names)

# actual features generation
poly = PolynomialFeatures(interaction_only=True, include_bias=False)
df_enhanced = pd.DataFrame(data = poly.fit_transform(seed_features), columns=poly.get_feature_names(seed_features.columns))