Pyspark 在 MinMaxScaler 之后丢失元数据
Pyspark Loses Metadata After MinMaxScaler
我使用的学生数据集来自:
https://archive.ics.uci.edu/ml/machine-learning-databases/00320/
如果我在管道中缩放功能,它会丢失我以后需要的大量元数据。这是没有缩放以生成元数据的基本设置。缩放选项已注释以便于复制。
我正在选择我希望用于模型的数字列和分类列。这是我的数据设置和管道,没有缩放以查看元数据。
# load data
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('student-performance').getOrCreate()
df_raw = spark.read.options(delimiter=';', header=True, inferSchema=True).csv('student-mat.csv')
# specify columns and filter
cols_cate = ['school', 'sex', 'Pstatus', 'Mjob', 'Fjob', 'famsup', 'activities', 'higher', 'internet', 'romantic']
cols_num = ['age', 'Medu', 'Fedu', 'studytime', 'failures', 'famrel', 'goout', 'Dalc', 'Walc', 'health', 'absences', 'G1', 'G2']
col_label = ['G3']
keep = cols_cate + cols_num + col_label
df_keep = df_raw.select(keep)
# setup pipeline
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler, MinMaxScaler
cols_assembly = []
stages = []
for col in cols_cate:
string_index = StringIndexer(inputCol=col, outputCol=col+'-indexed')
encoder = OneHotEncoder(inputCol=string_index.getOutputCol(), outputCol=col+'-encoded')
cols_assembly.append(encoder.getOutputCol())
stages += [string_index, encoder]
# assemble vectors
assembler_input = cols_assembly + cols_num
assembler = VectorAssembler(inputCols=assembler_input, outputCol='features')
stages += [assembler]
# MinMaxScalar option - will need to change 'features' -> 'scaled-features' later
#scaler = MinMaxScaler(inputCol='features', outputCol='scaled-features')
#stages += [scaler]
# apply pipeline
from pyspark.ml import Pipeline
pipeline = Pipeline(stages=stages)
pipelineModel = pipeline.fit(df_keep)
df_pipe = pipelineModel.transform(df_keep)
cols_selected = ['features'] + cols_cate + cols_num + ['G3']
df_pipe = df_pipe.select(cols_selected)
制作训练数据、拟合模型并进行预测。
from pyspark.ml.regression import LinearRegression
train, test = df_pipe.randomSplit([0.7, 0.3], seed=14)
lr = LinearRegression(featuresCol='features',labelCol='G3', maxIter=10, regParam=0.3, elasticNetParam=0.8)
lrModel = lr.fit(train)
lr_preds = lrModel.transform(test)
查看“功能”列的元数据我这里有很多信息。
lr_preds.schema['features'].metadata
输出:
{'ml_attr': {'attrs': {'numeric': [{'idx': 16, 'name': 'age'},
{'idx': 17, 'name': 'Medu'},
{'idx': 18, 'name': 'Fedu'},
{'idx': 19, 'name': 'studytime'},
{'idx': 20, 'name': 'failures'},
{'idx': 21, 'name': 'famrel'},
{'idx': 22, 'name': 'goout'},
{'idx': 23, 'name': 'Dalc'},
{'idx': 24, 'name': 'Walc'},
{'idx': 25, 'name': 'health'},
{'idx': 26, 'name': 'absences'},
{'idx': 27, 'name': 'G1'},
{'idx': 28, 'name': 'G2'}],
'binary': [{'idx': 0, 'name': 'school-encoded_GP'},
{'idx': 1, 'name': 'sex-encoded_F'},
{'idx': 2, 'name': 'Pstatus-encoded_T'},
{'idx': 3, 'name': 'Mjob-encoded_other'},
{'idx': 4, 'name': 'Mjob-encoded_services'},
{'idx': 5, 'name': 'Mjob-encoded_at_home'},
{'idx': 6, 'name': 'Mjob-encoded_teacher'},
{'idx': 7, 'name': 'Fjob-encoded_other'},
{'idx': 8, 'name': 'Fjob-encoded_services'},
{'idx': 9, 'name': 'Fjob-encoded_teacher'},
{'idx': 10, 'name': 'Fjob-encoded_at_home'},
{'idx': 11, 'name': 'famsup-encoded_yes'},
{'idx': 12, 'name': 'activities-encoded_yes'},
{'idx': 13, 'name': 'higher-encoded_yes'},
{'idx': 14, 'name': 'internet-encoded_yes'},
{'idx': 15, 'name': 'romantic-encoded_no'}]},
'num_attrs': 29}}
如果我在管道中的 VectorAssembler(上面已注释掉)之后添加缩放、重新训练并再次进行预测,它会丢失所有这些元数据。
lr_preds.schema['scaled-features'].metadata
输出:
{'ml_attr': {'num_attrs': 29}}
有什么方法可以恢复元数据吗?提前致谢!
mck 建议使用 lr_preds 中的 'features' 来获取元数据,它没有改变。谢谢。
the column features should remain in the dataframelr_preds, maybe you can get it from that column instead?
我使用的学生数据集来自: https://archive.ics.uci.edu/ml/machine-learning-databases/00320/
如果我在管道中缩放功能,它会丢失我以后需要的大量元数据。这是没有缩放以生成元数据的基本设置。缩放选项已注释以便于复制。
我正在选择我希望用于模型的数字列和分类列。这是我的数据设置和管道,没有缩放以查看元数据。
# load data
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('student-performance').getOrCreate()
df_raw = spark.read.options(delimiter=';', header=True, inferSchema=True).csv('student-mat.csv')
# specify columns and filter
cols_cate = ['school', 'sex', 'Pstatus', 'Mjob', 'Fjob', 'famsup', 'activities', 'higher', 'internet', 'romantic']
cols_num = ['age', 'Medu', 'Fedu', 'studytime', 'failures', 'famrel', 'goout', 'Dalc', 'Walc', 'health', 'absences', 'G1', 'G2']
col_label = ['G3']
keep = cols_cate + cols_num + col_label
df_keep = df_raw.select(keep)
# setup pipeline
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler, MinMaxScaler
cols_assembly = []
stages = []
for col in cols_cate:
string_index = StringIndexer(inputCol=col, outputCol=col+'-indexed')
encoder = OneHotEncoder(inputCol=string_index.getOutputCol(), outputCol=col+'-encoded')
cols_assembly.append(encoder.getOutputCol())
stages += [string_index, encoder]
# assemble vectors
assembler_input = cols_assembly + cols_num
assembler = VectorAssembler(inputCols=assembler_input, outputCol='features')
stages += [assembler]
# MinMaxScalar option - will need to change 'features' -> 'scaled-features' later
#scaler = MinMaxScaler(inputCol='features', outputCol='scaled-features')
#stages += [scaler]
# apply pipeline
from pyspark.ml import Pipeline
pipeline = Pipeline(stages=stages)
pipelineModel = pipeline.fit(df_keep)
df_pipe = pipelineModel.transform(df_keep)
cols_selected = ['features'] + cols_cate + cols_num + ['G3']
df_pipe = df_pipe.select(cols_selected)
制作训练数据、拟合模型并进行预测。
from pyspark.ml.regression import LinearRegression
train, test = df_pipe.randomSplit([0.7, 0.3], seed=14)
lr = LinearRegression(featuresCol='features',labelCol='G3', maxIter=10, regParam=0.3, elasticNetParam=0.8)
lrModel = lr.fit(train)
lr_preds = lrModel.transform(test)
查看“功能”列的元数据我这里有很多信息。
lr_preds.schema['features'].metadata
输出:
{'ml_attr': {'attrs': {'numeric': [{'idx': 16, 'name': 'age'},
{'idx': 17, 'name': 'Medu'},
{'idx': 18, 'name': 'Fedu'},
{'idx': 19, 'name': 'studytime'},
{'idx': 20, 'name': 'failures'},
{'idx': 21, 'name': 'famrel'},
{'idx': 22, 'name': 'goout'},
{'idx': 23, 'name': 'Dalc'},
{'idx': 24, 'name': 'Walc'},
{'idx': 25, 'name': 'health'},
{'idx': 26, 'name': 'absences'},
{'idx': 27, 'name': 'G1'},
{'idx': 28, 'name': 'G2'}],
'binary': [{'idx': 0, 'name': 'school-encoded_GP'},
{'idx': 1, 'name': 'sex-encoded_F'},
{'idx': 2, 'name': 'Pstatus-encoded_T'},
{'idx': 3, 'name': 'Mjob-encoded_other'},
{'idx': 4, 'name': 'Mjob-encoded_services'},
{'idx': 5, 'name': 'Mjob-encoded_at_home'},
{'idx': 6, 'name': 'Mjob-encoded_teacher'},
{'idx': 7, 'name': 'Fjob-encoded_other'},
{'idx': 8, 'name': 'Fjob-encoded_services'},
{'idx': 9, 'name': 'Fjob-encoded_teacher'},
{'idx': 10, 'name': 'Fjob-encoded_at_home'},
{'idx': 11, 'name': 'famsup-encoded_yes'},
{'idx': 12, 'name': 'activities-encoded_yes'},
{'idx': 13, 'name': 'higher-encoded_yes'},
{'idx': 14, 'name': 'internet-encoded_yes'},
{'idx': 15, 'name': 'romantic-encoded_no'}]},
'num_attrs': 29}}
如果我在管道中的 VectorAssembler(上面已注释掉)之后添加缩放、重新训练并再次进行预测,它会丢失所有这些元数据。
lr_preds.schema['scaled-features'].metadata
输出:
{'ml_attr': {'num_attrs': 29}}
有什么方法可以恢复元数据吗?提前致谢!
mck 建议使用 lr_preds 中的 'features' 来获取元数据,它没有改变。谢谢。
the column features should remain in the dataframelr_preds, maybe you can get it from that column instead?