如何在sparkml分类中指定"positive class"?
How to specify "positive class" in sparkml classification?
如何在 sparkml(二进制)class化中指定“正 class”? (或者也许:MulticlassClassificationEvaluator 如何确定哪个 class 是“正”?)
假设我们正在训练一个模型以在二元class化问题中以精度为目标,例如...
label_idxer = StringIndexer(inputCol="response",
outputCol="label").fit(df_spark)
# we fit so we can get the "labels" attribute to inform reconversion stage
feature_idxer = StringIndexer(inputCols=cat_features,
outputCols=[f"{f}_IDX" for f in cat_features],
handleInvalid="keep")
onehotencoder = OneHotEncoder(inputCols=feature_idxer.getOutputCols(),
outputCols=[f"{f}_OHE" for f in feature_idxer.getOutputCols()])
assembler = VectorAssembler(inputCols=(num_features + onehotencoder.getOutputCols()),
outputCol="features")
rf = RandomForestClassifier(labelCol=label_idxer.getOutputCol(),
featuresCol=assembler.getOutputCol(),
seed=123456789)
label_converter = IndexToString(inputCol=rf.getPredictionCol(),
outputCol="prediction_label",
labels=label_idxer.labels)
pipeline = Pipeline(stages=[label_idxer, feature_idxer, onehotencoder,
assembler,
rf,
label_converter]) # type: pyspark.ml.pipeline.PipelineModel
crossval = CrossValidator(estimator=pipeline,
evaluator=MulticlassClassificationEvaluator(
labelCol=rf.getLabelCol(),
predictionCol=rf.getPredictionCol(),
metricName="weightedPrecision"),
numFolds=3)
(train_u, test_u) = dff.randomSplit([0.8, 0.2])
model = crossval.fit(train_u)
我知道...
Precision = TP / (TP + FP)
...但是您如何将特定的 class 标签指定为“正 class”来定位精度? (就目前而言,IDK 实际上在训练中使用了哪个响应值,也不知道如何判断)。
来自 spark 邮件列表上的讨论...
The positive class is "1" and negative is "0" by convention; I don't think you can change that (though you can translate your data if needed).
F1 is defined only in a one-vs-rest sense in multi-class evaluation. You can set 'metricLabel' to define which class is 'positive' in multiclass - everything else is 'negative'.
请注意,这意味着(没有在 MulticlassEvaluator 中设置 metricLabel)StringIndexer(特别是 stringOrderType 参数 https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.feature.StringIndexer.html?highlight=stringindexer#pyspark.ml.feature.StringIndexer.stringOrderType
) 将是用户了解他们所说的是他们的 positive/negative class 的地方。 (请注意,根据文档,默认值为 frequencyDesc
。如果在 frequencyDesc/Asc 下频率相同,则字符串将按字母顺序进一步排序(即在少数阳性 class 的情况下)你会没事的,否则需要命名遵循 0=neg 1=pos 约定))。
In multi-class, there is no 'positive' class, they're all just classes. It defaults to 0 there but 0 doesn't have any particular meaning.
You could apply this to a binary class setup. In that case, you could simply ask for F1 for label 0, and that would compute F1 for '0-vs-rest', and that would be like treating 0 as the 'positive' class for purposes of F1.
关于这种解释的一个问题是,BinaryClassificationEvaluator 似乎没有评估 Fbeta、Recall、Precision 等的能力 (https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.evaluation.BinaryClassificationEvaluator.html?highlight=binaryclassificationevaluator#pyspark.ml.evaluation.BinaryClassificationEvaluator.metricName) whereas the MulticlassClassificationEvaluator does (https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.evaluation.MulticlassClassificationEvaluator.html?highlight=classificationevaluator#pyspark.ml.evaluation.MulticlassClassificationEvaluator.metricName),这意味着用户需要切换在两者之间,如果他们想尝试训练模型以 AreaUnderROC 或 F1 为目标,在二进制 classification 的情况下意味着他们需要将正 class 的索引值从 1 切换(在二进制 classification 中,因为你说 1 是传统的正数 class)到 0(对于 multiclass 评估器,因为文档说默认的 metricLabel 是 0)。
如何在 sparkml(二进制)class化中指定“正 class”? (或者也许:MulticlassClassificationEvaluator 如何确定哪个 class 是“正”?)
假设我们正在训练一个模型以在二元class化问题中以精度为目标,例如...
label_idxer = StringIndexer(inputCol="response",
outputCol="label").fit(df_spark)
# we fit so we can get the "labels" attribute to inform reconversion stage
feature_idxer = StringIndexer(inputCols=cat_features,
outputCols=[f"{f}_IDX" for f in cat_features],
handleInvalid="keep")
onehotencoder = OneHotEncoder(inputCols=feature_idxer.getOutputCols(),
outputCols=[f"{f}_OHE" for f in feature_idxer.getOutputCols()])
assembler = VectorAssembler(inputCols=(num_features + onehotencoder.getOutputCols()),
outputCol="features")
rf = RandomForestClassifier(labelCol=label_idxer.getOutputCol(),
featuresCol=assembler.getOutputCol(),
seed=123456789)
label_converter = IndexToString(inputCol=rf.getPredictionCol(),
outputCol="prediction_label",
labels=label_idxer.labels)
pipeline = Pipeline(stages=[label_idxer, feature_idxer, onehotencoder,
assembler,
rf,
label_converter]) # type: pyspark.ml.pipeline.PipelineModel
crossval = CrossValidator(estimator=pipeline,
evaluator=MulticlassClassificationEvaluator(
labelCol=rf.getLabelCol(),
predictionCol=rf.getPredictionCol(),
metricName="weightedPrecision"),
numFolds=3)
(train_u, test_u) = dff.randomSplit([0.8, 0.2])
model = crossval.fit(train_u)
我知道...
Precision = TP / (TP + FP)
...但是您如何将特定的 class 标签指定为“正 class”来定位精度? (就目前而言,IDK 实际上在训练中使用了哪个响应值,也不知道如何判断)。
来自 spark 邮件列表上的讨论...
The positive class is "1" and negative is "0" by convention; I don't think you can change that (though you can translate your data if needed). F1 is defined only in a one-vs-rest sense in multi-class evaluation. You can set 'metricLabel' to define which class is 'positive' in multiclass - everything else is 'negative'.
请注意,这意味着(没有在 MulticlassEvaluator 中设置 metricLabel)StringIndexer(特别是 stringOrderType 参数 https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.feature.StringIndexer.html?highlight=stringindexer#pyspark.ml.feature.StringIndexer.stringOrderType
) 将是用户了解他们所说的是他们的 positive/negative class 的地方。 (请注意,根据文档,默认值为 frequencyDesc
。如果在 frequencyDesc/Asc 下频率相同,则字符串将按字母顺序进一步排序(即在少数阳性 class 的情况下)你会没事的,否则需要命名遵循 0=neg 1=pos 约定))。
In multi-class, there is no 'positive' class, they're all just classes. It defaults to 0 there but 0 doesn't have any particular meaning. You could apply this to a binary class setup. In that case, you could simply ask for F1 for label 0, and that would compute F1 for '0-vs-rest', and that would be like treating 0 as the 'positive' class for purposes of F1.
关于这种解释的一个问题是,BinaryClassificationEvaluator 似乎没有评估 Fbeta、Recall、Precision 等的能力 (https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.evaluation.BinaryClassificationEvaluator.html?highlight=binaryclassificationevaluator#pyspark.ml.evaluation.BinaryClassificationEvaluator.metricName) whereas the MulticlassClassificationEvaluator does (https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.evaluation.MulticlassClassificationEvaluator.html?highlight=classificationevaluator#pyspark.ml.evaluation.MulticlassClassificationEvaluator.metricName),这意味着用户需要切换在两者之间,如果他们想尝试训练模型以 AreaUnderROC 或 F1 为目标,在二进制 classification 的情况下意味着他们需要将正 class 的索引值从 1 切换(在二进制 classification 中,因为你说 1 是传统的正数 class)到 0(对于 multiclass 评估器,因为文档说默认的 metricLabel 是 0)。