CV-Target-Mean-Encode 的 PySpark 动态均值计算
PySpark Dynamic mean calculation for CV-Target-Mean-Encode
使用 - Python 3.6,Spark 2.3
原DF-
key a_fold_0 b_fold_0 a_fold_1 b_fold_1 a_fold_2 b_fold_2
1 1 2 3 4 5 6
2 7 5 3 5 2 1
我想从下面的数据框中计算均值,如下所示(所有列和所有折叠都是这样)-
key a_fold_0 b_fold_0 a_fold_1 b_fold_1 a_fold_2 b_fold_2 a_fold_0_mean b_fold_0_mean a_fold_1_mean
1 1 2 3 4 5 6 3 + 5 / 2 4 + 6 / 2 1 + 5 / 2
2 7 5 3 5 2 1 3 + 2 / 2 5 + 1 / 2 7 + 2 / 2
进程-
对于 fold_0 我的意思应该是 fold_1 + fold_2 / 2
对于 fold_1 我的意思应该是 fold_0 + fold_2 / 2
对于 fold_2 我的意思应该是 fold_0 + fold_1 / 2
对于每一列。
而我的列数,没有。折叠,一切都将是动态的。
如何在 pyspark 数据帧上解决这个问题?
尝试通过交叉验证目标均值编码技术创建新特征。
自己解决了。
万一有人需要重用代码 -
orig_list = ['Married-spouse-absent', 'Married-AF-spouse', 'Separated', 'Married-civ-spouse', 'Widowed', 'Divorced', 'Never-married']
k_folds = 3
cols = df.columns # ['fnlwgt_bucketed', 'Married-spouse-absent_fold_0', 'Married-AF-spouse_fold_0', 'Separated_fold_0', 'Married-civ-spouse_fold_0', 'Widowed_fold_0', 'Divorced_fold_0', 'Never-married_fold_0', 'Married-spouse-absent_fold_1', 'Married-AF-spouse_fold_1', 'Separated_fold_1', 'Married-civ-spouse_fold_1', 'Widowed_fold_1', 'Divorced_fold_1', 'Never-married_fold_1', 'Married-spouse-absent_fold_2', 'Married-AF-spouse_fold_2', 'Separated_fold_2', 'Married-civ-spouse_fold_2', 'Widowed_fold_2', 'Divorced_fold_2', 'Never-married_fold_2']
for folds in range(k_folds):
for column in orig_list:
col_namer = []
for fold in range(k_folds):
if fold != folds:
col_namer.append(column+'_fold_'+str(fold))
df = df.withColumn(column+'_fold_'+str(folds)+'_mean', (sum(df[col] for col in col_namer)/(k_folds-1)))
print(col_namer)
df.show(1)
使用 - Python 3.6,Spark 2.3
原DF-
key a_fold_0 b_fold_0 a_fold_1 b_fold_1 a_fold_2 b_fold_2
1 1 2 3 4 5 6
2 7 5 3 5 2 1
我想从下面的数据框中计算均值,如下所示(所有列和所有折叠都是这样)-
key a_fold_0 b_fold_0 a_fold_1 b_fold_1 a_fold_2 b_fold_2 a_fold_0_mean b_fold_0_mean a_fold_1_mean
1 1 2 3 4 5 6 3 + 5 / 2 4 + 6 / 2 1 + 5 / 2
2 7 5 3 5 2 1 3 + 2 / 2 5 + 1 / 2 7 + 2 / 2
进程-
对于 fold_0 我的意思应该是 fold_1 + fold_2 / 2
对于 fold_1 我的意思应该是 fold_0 + fold_2 / 2
对于 fold_2 我的意思应该是 fold_0 + fold_1 / 2
对于每一列。
而我的列数,没有。折叠,一切都将是动态的。
如何在 pyspark 数据帧上解决这个问题?
尝试通过交叉验证目标均值编码技术创建新特征。
自己解决了。
万一有人需要重用代码 -
orig_list = ['Married-spouse-absent', 'Married-AF-spouse', 'Separated', 'Married-civ-spouse', 'Widowed', 'Divorced', 'Never-married']
k_folds = 3
cols = df.columns # ['fnlwgt_bucketed', 'Married-spouse-absent_fold_0', 'Married-AF-spouse_fold_0', 'Separated_fold_0', 'Married-civ-spouse_fold_0', 'Widowed_fold_0', 'Divorced_fold_0', 'Never-married_fold_0', 'Married-spouse-absent_fold_1', 'Married-AF-spouse_fold_1', 'Separated_fold_1', 'Married-civ-spouse_fold_1', 'Widowed_fold_1', 'Divorced_fold_1', 'Never-married_fold_1', 'Married-spouse-absent_fold_2', 'Married-AF-spouse_fold_2', 'Separated_fold_2', 'Married-civ-spouse_fold_2', 'Widowed_fold_2', 'Divorced_fold_2', 'Never-married_fold_2']
for folds in range(k_folds):
for column in orig_list:
col_namer = []
for fold in range(k_folds):
if fold != folds:
col_namer.append(column+'_fold_'+str(fold))
df = df.withColumn(column+'_fold_'+str(folds)+'_mean', (sum(df[col] for col in col_namer)/(k_folds-1)))
print(col_namer)
df.show(1)