Sklearn 预处理 - PolynomialFeatures - 如何保留输出数组/数据帧的列 names/headers
Sklearn preprocessing - PolynomialFeatures - How to keep column names/headers of the output array / dataframe
TLDR:如何从 sklearn.preprocessing.PolynomialFeatures() 函数获取输出 numpy 数组的 headers?
假设我有以下代码...
import pandas as pd
import numpy as np
from sklearn import preprocessing as pp
a = np.ones(3)
b = np.ones(3) * 2
c = np.ones(3) * 3
input_df = pd.DataFrame([a,b,c])
input_df = input_df.T
input_df.columns=['a', 'b', 'c']
input_df
a b c
0 1 2 3
1 1 2 3
2 1 2 3
poly = pp.PolynomialFeatures(2)
output_nparray = poly.fit_transform(input_df)
print output_nparray
[[ 1. 1. 2. 3. 1. 2. 3. 4. 6. 9.]
[ 1. 1. 2. 3. 1. 2. 3. 4. 6. 9.]
[ 1. 1. 2. 3. 1. 2. 3. 4. 6. 9.]]
如何获得 3x10 矩阵/output_nparray 来传递 a、b、c 标签它们与上述数据的关系?
这个有效:
def PolynomialFeatures_labeled(input_df,power):
'''Basically this is a cover for the sklearn preprocessing function.
The problem with that function is if you give it a labeled dataframe, it ouputs an unlabeled dataframe with potentially
a whole bunch of unlabeled columns.
Inputs:
input_df = Your labeled pandas dataframe (list of x's not raised to any power)
power = what order polynomial you want variables up to. (use the same power as you want entered into pp.PolynomialFeatures(power) directly)
Ouput:
Output: This function relies on the powers_ matrix which is one of the preprocessing function's outputs to create logical labels and
outputs a labeled pandas dataframe
'''
poly = pp.PolynomialFeatures(power)
output_nparray = poly.fit_transform(input_df)
powers_nparray = poly.powers_
input_feature_names = list(input_df.columns)
target_feature_names = ["Constant Term"]
for feature_distillation in powers_nparray[1:]:
intermediary_label = ""
final_label = ""
for i in range(len(input_feature_names)):
if feature_distillation[i] == 0:
continue
else:
variable = input_feature_names[i]
power = feature_distillation[i]
intermediary_label = "%s^%d" % (variable,power)
if final_label == "": #If the final label isn't yet specified
final_label = intermediary_label
else:
final_label = final_label + " x " + intermediary_label
target_feature_names.append(final_label)
output_df = pd.DataFrame(output_nparray, columns = target_feature_names)
return output_df
output_df = PolynomialFeatures_labeled(input_df,2)
output_df
Constant Term a^1 b^1 c^1 a^2 a^1 x b^1 a^1 x c^1 b^2 b^1 x c^1 c^2
0 1 1 2 3 1 2 3 4 6 9
1 1 1 2 3 1 2 3 4 6 9
2 1 1 2 3 1 2 3 4 6 9
工作示例,全部在一行中(我假设 "readability" 不是这里的目标):
target_feature_names = ['x'.join(['{}^{}'.format(pair[0],pair[1]) for pair in tuple if pair[1]!=0]) for tuple in [zip(input_df.columns,p) for p in poly.powers_]]
output_df = pd.DataFrame(output_nparray, columns = target_feature_names)
Update: as @OmerB pointed out, now you can use the get_feature_names
method:
>> poly.get_feature_names(input_df.columns)
['1', 'a', 'b', 'c', 'a^2', 'a b', 'a c', 'b^2', 'b c', 'c^2']
scikit-learn 0.18 添加了一个漂亮的 get_feature_names()
方法!
>> input_df.columns
Index(['a', 'b', 'c'], dtype='object')
>> poly.fit_transform(input_df)
array([[ 1., 1., 2., 3., 1., 2., 3., 4., 6., 9.],
[ 1., 1., 2., 3., 1., 2., 3., 4., 6., 9.],
[ 1., 1., 2., 3., 1., 2., 3., 4., 6., 9.]])
>> poly.get_feature_names(input_df.columns)
['1', 'a', 'b', 'c', 'a^2', 'a b', 'a c', 'b^2', 'b c', 'c^2']
请注意,您必须为其提供列名,因为 sklearn 本身不会从 DataFrame 中读取它。
get_feature_names()
方法很好,但它 returns 所有变量如 'x1'
、'x2'
、'x1 x2'
、...等。下面是一个将 get_feature_names()
输出快速转换为格式为 'Col_1'
、'Col_2'
、'Col_1 x Col_2'
:
的列名列表的函数
输入:
def PolynomialFeatureNames(sklearn_feature_name_output, df):
"""
This function takes the output from the .get_feature_names() method on the PolynomialFeatures
instance and replaces values with df column names to return output such as 'Col_1 x Col_2'
sklearn_feature_name_output: The list object returned when calling .get_feature_names() on the PolynomialFeatures object
df: Pandas dataframe with correct column names
"""
import re
cols = df.columns.tolist()
feat_map = {'x'+str(num):cat for num, cat in enumerate(cols)}
feat_string = ','.join(sklearn_feature_name_output)
for k,v in feat_map.items():
feat_string = re.sub(fr"\b{k}\b",v,feat_string)
return feat_string.replace(" "," x ").split(',')
interaction = PolynomialFeatures(degree=2)
X_inter = interaction.fit_transform(input_df)
names = PolynomialFeatureNames(interaction.get_feature_names(),input_df)
print(pd.DataFrame(X_inter, columns= names))
输出:
1 a b c a^2 a x b a x c b^2 b x c \
0 1.00000 1.00000 2.00000 3.00000 1.00000 2.00000 3.00000 4.00000 6.00000
1 1.00000 1.00000 2.00000 3.00000 1.00000 2.00000 3.00000 4.00000 6.00000
2 1.00000 1.00000 2.00000 3.00000 1.00000 2.00000 3.00000 4.00000 6.00000
c^2
0 9.00000
1 9.00000
2 9.00000
TLDR:如何从 sklearn.preprocessing.PolynomialFeatures() 函数获取输出 numpy 数组的 headers?
假设我有以下代码...
import pandas as pd
import numpy as np
from sklearn import preprocessing as pp
a = np.ones(3)
b = np.ones(3) * 2
c = np.ones(3) * 3
input_df = pd.DataFrame([a,b,c])
input_df = input_df.T
input_df.columns=['a', 'b', 'c']
input_df
a b c
0 1 2 3
1 1 2 3
2 1 2 3
poly = pp.PolynomialFeatures(2)
output_nparray = poly.fit_transform(input_df)
print output_nparray
[[ 1. 1. 2. 3. 1. 2. 3. 4. 6. 9.]
[ 1. 1. 2. 3. 1. 2. 3. 4. 6. 9.]
[ 1. 1. 2. 3. 1. 2. 3. 4. 6. 9.]]
如何获得 3x10 矩阵/output_nparray 来传递 a、b、c 标签它们与上述数据的关系?
这个有效:
def PolynomialFeatures_labeled(input_df,power):
'''Basically this is a cover for the sklearn preprocessing function.
The problem with that function is if you give it a labeled dataframe, it ouputs an unlabeled dataframe with potentially
a whole bunch of unlabeled columns.
Inputs:
input_df = Your labeled pandas dataframe (list of x's not raised to any power)
power = what order polynomial you want variables up to. (use the same power as you want entered into pp.PolynomialFeatures(power) directly)
Ouput:
Output: This function relies on the powers_ matrix which is one of the preprocessing function's outputs to create logical labels and
outputs a labeled pandas dataframe
'''
poly = pp.PolynomialFeatures(power)
output_nparray = poly.fit_transform(input_df)
powers_nparray = poly.powers_
input_feature_names = list(input_df.columns)
target_feature_names = ["Constant Term"]
for feature_distillation in powers_nparray[1:]:
intermediary_label = ""
final_label = ""
for i in range(len(input_feature_names)):
if feature_distillation[i] == 0:
continue
else:
variable = input_feature_names[i]
power = feature_distillation[i]
intermediary_label = "%s^%d" % (variable,power)
if final_label == "": #If the final label isn't yet specified
final_label = intermediary_label
else:
final_label = final_label + " x " + intermediary_label
target_feature_names.append(final_label)
output_df = pd.DataFrame(output_nparray, columns = target_feature_names)
return output_df
output_df = PolynomialFeatures_labeled(input_df,2)
output_df
Constant Term a^1 b^1 c^1 a^2 a^1 x b^1 a^1 x c^1 b^2 b^1 x c^1 c^2
0 1 1 2 3 1 2 3 4 6 9
1 1 1 2 3 1 2 3 4 6 9
2 1 1 2 3 1 2 3 4 6 9
工作示例,全部在一行中(我假设 "readability" 不是这里的目标):
target_feature_names = ['x'.join(['{}^{}'.format(pair[0],pair[1]) for pair in tuple if pair[1]!=0]) for tuple in [zip(input_df.columns,p) for p in poly.powers_]]
output_df = pd.DataFrame(output_nparray, columns = target_feature_names)
Update: as @OmerB pointed out, now you can use the
get_feature_names
method:
>> poly.get_feature_names(input_df.columns)
['1', 'a', 'b', 'c', 'a^2', 'a b', 'a c', 'b^2', 'b c', 'c^2']
scikit-learn 0.18 添加了一个漂亮的 get_feature_names()
方法!
>> input_df.columns
Index(['a', 'b', 'c'], dtype='object')
>> poly.fit_transform(input_df)
array([[ 1., 1., 2., 3., 1., 2., 3., 4., 6., 9.],
[ 1., 1., 2., 3., 1., 2., 3., 4., 6., 9.],
[ 1., 1., 2., 3., 1., 2., 3., 4., 6., 9.]])
>> poly.get_feature_names(input_df.columns)
['1', 'a', 'b', 'c', 'a^2', 'a b', 'a c', 'b^2', 'b c', 'c^2']
请注意,您必须为其提供列名,因为 sklearn 本身不会从 DataFrame 中读取它。
get_feature_names()
方法很好,但它 returns 所有变量如 'x1'
、'x2'
、'x1 x2'
、...等。下面是一个将 get_feature_names()
输出快速转换为格式为 'Col_1'
、'Col_2'
、'Col_1 x Col_2'
:
输入:
def PolynomialFeatureNames(sklearn_feature_name_output, df):
"""
This function takes the output from the .get_feature_names() method on the PolynomialFeatures
instance and replaces values with df column names to return output such as 'Col_1 x Col_2'
sklearn_feature_name_output: The list object returned when calling .get_feature_names() on the PolynomialFeatures object
df: Pandas dataframe with correct column names
"""
import re
cols = df.columns.tolist()
feat_map = {'x'+str(num):cat for num, cat in enumerate(cols)}
feat_string = ','.join(sklearn_feature_name_output)
for k,v in feat_map.items():
feat_string = re.sub(fr"\b{k}\b",v,feat_string)
return feat_string.replace(" "," x ").split(',')
interaction = PolynomialFeatures(degree=2)
X_inter = interaction.fit_transform(input_df)
names = PolynomialFeatureNames(interaction.get_feature_names(),input_df)
print(pd.DataFrame(X_inter, columns= names))
输出:
1 a b c a^2 a x b a x c b^2 b x c \
0 1.00000 1.00000 2.00000 3.00000 1.00000 2.00000 3.00000 4.00000 6.00000
1 1.00000 1.00000 2.00000 3.00000 1.00000 2.00000 3.00000 4.00000 6.00000
2 1.00000 1.00000 2.00000 3.00000 1.00000 2.00000 3.00000 4.00000 6.00000
c^2
0 9.00000
1 9.00000
2 9.00000