如何有效地跨数据帧配对乘法
How do I pair multiplication across dataframe efficiently
我想使用多个数字特征进行特征工程,想法是跨数据框进行对乘法,首选答案是机器学习库中可用的东西,例如 TensorFlow, Keras, TPOT, H20,等等(我不知道知道这个过程的科学名称),但是没有图书馆也可以。
这是我的简化数据集
No feature_1 feature_2 feature_3
1 10 20 30
2 20 30 40
这是我需要的
No feature_1 feature_2 feature_3 feature_1xfeature2 feature_1xfeature_2 feature_2xfeature_3
1 10 20 30 200 300 600
2 20 30 40 600 800 1200
我做了什么
df['feature_1xfeature2'] = df['feature_1'] * df['feature_2']
df['feature_1xfeature3'] = df['feature_1'] * df['feature_3']
df['feature_2xfeature3'] = df['feature_2'] * df['feature_3']
大量功能容易出错。如何自动执行此操作?
您可以使用 itertools
得到所有列的乘积:
import itertools
for col_a, col_b in itertools.product(df.columns, 2):
df[col_a + 'x' + col_b] = df[col_a] * df[col_b]
itertools.product(df.columns, 2)
从 df.columns 中取出 2 项时生成所有列组合。
编辑
更详细地查看您的问题,我认为您最好改用 itertools.combinations
。这不会产生所有可能的产品,但会产生所有可能的组合。
示例,假设列 'A'、'B'、'C'
itertools.product
产生 ('A', 'A'), ('A', 'B'), ('A', 'C'), ('B', 'A'), ('B', 'B'), ('B', 'C'), ('C', 'A'), ('C', 'B'), ('C', 'C').
itertools.combinations
产生 ('A', 'B'), ('A', 'C'), ('B', 'C')
因此,这样会更好:
import itertools
for col_a, col_b in itertools.combinations(df.columns, 2):
df[col_a + 'x' + col_b] = df[col_a] * df[col_b]
还有其他更专业的方法可以自动完成。例如。 PolynomialFeatures
:
import pandas as pd
from sklearn.preprocessing import PolynomialFeatures
# original data
df = pd.DataFrame(data = [[1, 10, 20, 30], [2, 20, 30, 40]], columns = ['No', 'feature_1', 'feature_2', 'feature_3'])
# selecting features to use for features generation
seed_feature_names = ['feature_1', 'feature_2', 'feature_3']
seed_features = pd.DataFrame(data = df[seed_feature_names], columns=seed_feature_names)
# actual features generation
poly = PolynomialFeatures(interaction_only=True, include_bias=False)
df_enhanced = pd.DataFrame(data = poly.fit_transform(seed_features), columns=poly.get_feature_names(seed_features.columns))
我想使用多个数字特征进行特征工程,想法是跨数据框进行对乘法,首选答案是机器学习库中可用的东西,例如 TensorFlow, Keras, TPOT, H20,等等(我不知道知道这个过程的科学名称),但是没有图书馆也可以。
这是我的简化数据集
No feature_1 feature_2 feature_3
1 10 20 30
2 20 30 40
这是我需要的
No feature_1 feature_2 feature_3 feature_1xfeature2 feature_1xfeature_2 feature_2xfeature_3
1 10 20 30 200 300 600
2 20 30 40 600 800 1200
我做了什么
df['feature_1xfeature2'] = df['feature_1'] * df['feature_2']
df['feature_1xfeature3'] = df['feature_1'] * df['feature_3']
df['feature_2xfeature3'] = df['feature_2'] * df['feature_3']
大量功能容易出错。如何自动执行此操作?
您可以使用 itertools
得到所有列的乘积:
import itertools
for col_a, col_b in itertools.product(df.columns, 2):
df[col_a + 'x' + col_b] = df[col_a] * df[col_b]
itertools.product(df.columns, 2)
从 df.columns 中取出 2 项时生成所有列组合。
编辑
更详细地查看您的问题,我认为您最好改用 itertools.combinations
。这不会产生所有可能的产品,但会产生所有可能的组合。
示例,假设列 'A'、'B'、'C'
itertools.product
产生 ('A', 'A'), ('A', 'B'), ('A', 'C'), ('B', 'A'), ('B', 'B'), ('B', 'C'), ('C', 'A'), ('C', 'B'), ('C', 'C').
itertools.combinations
产生 ('A', 'B'), ('A', 'C'), ('B', 'C')
因此,这样会更好:
import itertools
for col_a, col_b in itertools.combinations(df.columns, 2):
df[col_a + 'x' + col_b] = df[col_a] * df[col_b]
还有其他更专业的方法可以自动完成。例如。 PolynomialFeatures
:
import pandas as pd
from sklearn.preprocessing import PolynomialFeatures
# original data
df = pd.DataFrame(data = [[1, 10, 20, 30], [2, 20, 30, 40]], columns = ['No', 'feature_1', 'feature_2', 'feature_3'])
# selecting features to use for features generation
seed_feature_names = ['feature_1', 'feature_2', 'feature_3']
seed_features = pd.DataFrame(data = df[seed_feature_names], columns=seed_feature_names)
# actual features generation
poly = PolynomialFeatures(interaction_only=True, include_bias=False)
df_enhanced = pd.DataFrame(data = poly.fit_transform(seed_features), columns=poly.get_feature_names(seed_features.columns))