保存的随机森林模型在同一数据集上产生不同的结果

Question

我在使用保存在磁盘上的随机森林模型和使用完全相同的数据集进行预测时重现结果时遇到问题。换句话说，我用数据集 A 训练了一个模型并将其保存在我的本地机器上，然后我加载它并使用它来预测数据集 B，每次我预测数据集 B 我都会得到不同的结果。

我知道随机森林分类器中涉及的随机性，但据我了解，这种随机性是在训练过程中发生的，一旦创建了模型，如果您使用相同的数据进行预测，则预测不应改变.

训练脚本结构如下：

df_train = spark.read.format("csv") \
      .option('header', 'true') \
      .option('inferSchema', 'true') \
      .option('delimiter', ';') \
      .load("C:20_05.csv") 

#The problem seems to be related to the StringIndexer/One-Hot Encoding
#If I remove all categorical variables the results can be reproduced
categorical_variables = []
for variable in df_train.dtypes:
    if variable[1] == 'string' :
       categorical_variables.append(variable[0])

indexers = [StringIndexer(inputCol=col, outputCol=col+"_indexed") for col in categorical_variables]

for indexer in indexers:
    df_train =indexer.fit(df_train).transform(df_train)
    df_train = df_train.drop(indexer.getInputCol())
      
indexed_cols = []
for variable in df_train.columns:
    if variable.endswith("_indexed"):
        indexed_cols.append(variable)

encoders = []
for variable in indexed_cols:
    inputCol = variable
    outputCol = variable.replace("_indexed", "_encoded")
    one_hot_encoder_estimator_train = OneHotEncoderEstimator(inputCols=[inputCol], outputCols=[outputCol])

    encoder_model_train = one_hot_encoder_estimator_train.fit(df_train)
    df_train = encoder_model_train.transform(df_train)
    df_train = df_train.drop(inputCol)


inputCols = [x for x in df_train.columns if x != "id" and x != "churn"]

vector_assembler_train = VectorAssembler(
      inputCols=inputCols,
      outputCol='features',
      handleInvalid='keep'
)

df_train = vector_assembler_train.transform(df_train)

df_train = df_train.select('churn', 'features', 'id')

df_train_1 = df_train.filter(df_train['churn'] == 0).sample(withReplacement=False, fraction=0.3, seed=7)
df_train_2 = df_train.filter(df_train['churn'] == 1).sample(withReplacement=True, fraction=20.0, seed=7)
df_train = df_train_1.unionAll(df_train_2) 

rf = RandomForestClassifier(labelCol="churn", featuresCol="features")
  paramGrid = ParamGridBuilder() \
      .addGrid(rf.numTrees, [100]) \
      .addGrid(rf.maxDepth, [15]) \
      .addGrid(rf.maxBins, [32]) \
      .addGrid(rf.featureSubsetStrategy, ['onethird']) \
      .addGrid(rf.subsamplingRate, [1.0])\
      .addGrid(rf.minInfoGain, [0.0])\
      .addGrid(rf.impurity, ['gini']) \
      .addGrid(rf.minInstancesPerNode, [1]) \
      .addGrid(rf.seed, [10]) \
  .build()



  evaluator = BinaryClassificationEvaluator(
      labelCol="churn")

  crossval = CrossValidator(estimator=rf,
                            estimatorParamMaps=paramGrid,
                            evaluator=evaluator,
                            numFolds=3)
  model = crossval.fit(df_train)
  model.save("C:/myModel")

测试脚本如下：

df_test = spark.read.format("csv") \
      .option('header', 'true') \
      .option('inferSchema', 'true') \
      .option('delimiter', ';') \
      .load("C:20_06.csv")
  
#The problem seems to be related to the StringIndexer/One-Hot Encoding
#If I remove all categorical variables the results can be reproduced
categorical_variables = []
for variable in df_test.dtypes:
    if variable[1] == 'string' :
       categorical_variables.append(variable[0])

indexers = [StringIndexer(inputCol=col, outputCol=col+"_indexed") for col in categorical_variables]

for indexer in indexers:
    df_test =indexer.fit(df_test).transform(df_test)
    df_test = df_test.drop(indexer.getInputCol())
      
indexed_cols = []
for variable in df_test.columns:
    if variable.endswith("_indexed"):
        indexed_cols.append(variable)

encoders = []
for variable in indexed_cols:
    inputCol = variable
    outputCol = variable.replace("_indexed", "_encoded")
    one_hot_encoder_estimator_test = OneHotEncoderEstimator(inputCols=[inputCol], outputCols=[outputCol])

    encoder_model_test= one_hot_encoder_estimator_test.fit(df_test)
    df_test= encoder_model_test.transform(df_test)
    df_test= df_test.drop(inputCol)


inputCols = [x for x in df_test.columns if x != "id" and x != "churn"]

vector_assembler_test = VectorAssembler(
      inputCols=inputCols,
      outputCol='features',
      handleInvalid='keep'
)

df_test = vector_assembler_test.transform(df_test)

df_test = df_test.select('churn', 'features', 'id')


model = CrossValidatorModel.load("C:/myModel")

result = model.transform(df_test)

areaUnderROC = evaluator.evaluate(result)

tp = result.filter("prediction == 1.0 AND churn == 1").count()
tn = result.filter("prediction == 0.0 AND churn == 0").count()
fp = result.filter("prediction == 1.0 AND churn == 0").count()
fn = result.filter("prediction == 0.0 AND churn == 1").count()

每次我运行测试脚本时，AUC 和混淆矩阵总是不同的。我在 Windows 10 机器上使用 Spark 2.4.5 和 Python 3.7。非常感谢任何建议或想法。

编辑： 问题与 StringIndexer/One-Hot 编码步骤有关。当我只使用数值变量时，我能够重现结果。这个问题仍然悬而未决，因为我无法解释为什么会这样。

Answer 1

根据我的经验，这个问题是因为你是 re-evaluating 测试中的 OneHotEncoder。

以下是 OneHotEncoding 的工作原理，来自 docs:

A one-hot encoder that maps a column of category indices to a column of binary vectors, with at most a single one-value per row that indicates the input category index. For example with 5 categories, an input value of 2.0 would map to an output vector of [0.0, 0.0, 1.0, 0.0]. The last category is not included by default (configurable via dropLast), because it makes the vector entries sum up to one, and hence linearly dependent. So an input value of 4.0 maps to [0.0, 0.0, 0.0, 0.0].

因此，每次数据不同（训练与测试自然如此），One Hot Encoder 在向量中产生的值也不同。

您应该将 OneHotEncoder 与训练好的模型一起添加到管道中，进行拟合然后保存，然后在测试中再次加载它。这样，每次数据运行通过管道时，One Hot Encoded 值都保证与相同的值匹配。

有关保存和加载管道的更多详细信息，请参阅 documentation。

保存的随机森林模型在同一数据集上产生不同的结果

Saved Random Forest model produces different results on the same dataset

random-forest

apache-spark

pyspark

apache-spark-ml

one-hot-encoding