火花随机森林分类器 - 将标签作为字符串
spark random forest classifier - get labels as String
我是 Spark 的新手,我想将它用于随机森林分类器。
我使用 libsvm 格式的 Iris 数据来构建模型。
我的问题是 - 如何获取字符串形式的标签? (在这种情况下 - 标签是鸢尾花的类型)。
当数据转换为 libsvm 格式时,每个标签都会得到一个表示它的整数,但我不知道如何返回到字符串标签。
libsvm 可以吗?或者我应该使用其他格式?
这是我的代码:
public PipelineModel runRandomForestAlgorithm(String dataPath) {
System.setProperty("hadoop.home.dir", "C:/hadoop");
SparkSession spark =
SparkSession.builder().appName("JavaRandomForestClassifierExample").master("local[*]").getOrCreate();
/* Load and parse the data file, converting it to a DataFrame. */
DataFrameReader dataFrameReader = spark.read().format("libsvm");
Dataset<Row> data = dataFrameReader.load(dataPath);
/* Index labels, adding metadata to the label column.
Fit on whole dataset to include all labels in index. */
StringIndexerModel labelIndexer = new StringIndexer().setInputCol("label").setOutputCol("indexedLabel").fit(data);
/* Automatically identify categorical features, and index them.
Set maxCategories so features with > 4 distinct values are treated as continuous. */
VectorIndexerModel featureIndexer =
new VectorIndexer().setInputCol("features").setOutputCol("indexedFeatures").setMaxCategories(4).fit(data);
/* Split the data into training and test sets (30% held out for testing) */
Dataset<Row>[] splits = data.randomSplit(new double[]{0.9, 0.1});
Dataset<Row> trainingData = splits[0];
testData = splits[1];
/* Train a RandomForest model. */
RandomForestClassifier rf =
new RandomForestClassifier().setLabelCol("indexedLabel").setFeaturesCol("indexedFeatures").setNumTrees(10);
/* Convert indexed labels back to original labels. */
IndexToString labelConverter =
new IndexToString().setInputCol("prediction").setOutputCol("predictedLabel").setLabels(labelIndexer.labels());
/* Chain indexers and forest in a Pipeline */
Pipeline pipeline = new Pipeline().setStages(new PipelineStage[]{labelIndexer, featureIndexer, rf, labelConverter});
/* Train model. This also runs the indexers. */
PipelineModel model = pipeline.fit(trainingData);
/* Make predictions. */
Dataset<Row> predictions = model.transform(testData);
/* Select example rows to display. */
List<Row> predictionAsRows =
predictions.select("predictedLabel", "label", "features", "rawPrediction", "probability").collectAsList();
predictionAsRows.forEach(row -> {
System.out.println("predictedLabel: " + row.get(0) + " , " + "label: " + row.get(1) + " , " + "features: " + row.get(2) + " , " +
"predictions: " + row.get(3) + " , " + "probabilities: " + row.get(4));
});
这是输出:
predictedLabel: 1.0 , label: 1.0 , features: (4,[0,1,2,3],
[-0.833333,0.333333,-1.0,-0.916667]) , predictions: [10.0,0.0,0.0] ,
probabilities: [1.0,0.0,0.0]
predictedLabel: 1.0 , label: 1.0 , features: (4,[0,1,2,3],
[-0.555556,0.166667,-0.830508,-0.916667]) , predictions: [10.0,0.0,0.0]
, probabilities: [1.0,0.0,0.0]
predictedLabel: 2.0 , label: 2.0 , features: (4,[0,1,2,3],
[-0.333333,-0.75,0.0169491,-4.03573E-8]) , predictions: [0.0,0.0,10.0] ,
probabilities: [0.0,0.0,1.0]
predictedLabel: 2.0 , label: 2.0 , features: (4,[0,1,2,3],
[-0.166667,-0.416667,-0.0169491,-0.0833333]) , predictions:
[0.0,0.0,10.0] , probabilities: [0.0,0.0,1.0]
predictedLabel: 2.0 , label: 2.0 , features: (4,[0,1,2,3],
[0.166667,-0.25,0.118644,-4.03573E-8]) , predictions: [0.0,0.0,10.0] ,
probabilities: [0.0,0.0,1.0]
predictedLabel: 2.0 , label: 2.0 , features: (4,[0,1,2,3],
[0.277778,-0.166667,0.152542,0.0833333]) , predictions: [0.0,0.0,10.0] ,
probabilities: [0.0,0.0,1.0]
predictedLabel: 2.0 , label: 2.0 , features: (4,[0,2,3],
[0.5,0.254237,0.0833333]) , predictions: [0.0,0.0,10.0] , probabilities:
[0.0,0.0,1.0]
predictedLabel: 3.0 , label: 3.0 , features: (4,[0,1,2,3],
[-0.166667,-0.416667,0.38983,0.5]) , predictions: [0.0,9.875,0.125] ,
probabilities: [0.0,0.9875,0.0125]
predictedLabel: 3.0 , label: 3.0 , features: (4,[0,1,2,3],
[0.555555,-0.166667,0.661017,0.666667]) , predictions: [0.0,10.0,0.0] ,
probabilities: [0.0,1.0,0.0]
predictedLabel: 3.0 , label: 3.0 , features: (4,[0,1,2,3],
[0.833333,-0.166667,0.898305,0.666667]) , predictions: [0.0,10.0,0.0] ,
probabilities: [0.0,1.0,0.0]
predictedLabel: 3.0 , label: 3.0 , features: (4,[0,2,3],
[0.222222,0.38983,0.583333]) , predictions: [0.0,10.0,0.0] ,
probabilities: [0.0,1.0,0.0]
predictedLabel: 3.0 , label: 3.0 , features: (4,[0,2,3],
[0.388889,0.661017,0.833333]) , predictions: [0.0,10.0,0.0] , probabilities: [0.0,1.0,0.0]
使用 libsvm 格式,您只能为每个 class 获取一个整数,因此您无法从那里获取字符串 class 标签。
您可以通过 setLabels()
方法使用 IndexToString()
转换器。只需输入您拥有的标签数组。为此,您可能应该删除 StringIndexerModel()
(无论如何都没有必要,因为 classes 是数字,而不是字符串)。示例:
String[] labels = {"Setosa", "Versicolor", "Virginica"};
IndexToString labelConverter = new IndexToString().setInputCol("prediction").setOutputCol("predictedLabel").setLabels(labels);
您可以选择创建一个单独的 Map
,在其中将整数映射到字符串标签。对于 Iris 数据集,它可能如下所示:
Map labels = new HashMap();
labels.put(1, "Setosa");
labels.put(2, "Versicolour");
labels.put(3, "Virginica");
然后您可以使用此 Map
在完成所有 Spark 转换后获取字符串标签。
希望对您有所帮助。
我是 Spark 的新手,我想将它用于随机森林分类器。 我使用 libsvm 格式的 Iris 数据来构建模型。
我的问题是 - 如何获取字符串形式的标签? (在这种情况下 - 标签是鸢尾花的类型)。
当数据转换为 libsvm 格式时,每个标签都会得到一个表示它的整数,但我不知道如何返回到字符串标签。
libsvm 可以吗?或者我应该使用其他格式?
这是我的代码:
public PipelineModel runRandomForestAlgorithm(String dataPath) {
System.setProperty("hadoop.home.dir", "C:/hadoop");
SparkSession spark =
SparkSession.builder().appName("JavaRandomForestClassifierExample").master("local[*]").getOrCreate();
/* Load and parse the data file, converting it to a DataFrame. */
DataFrameReader dataFrameReader = spark.read().format("libsvm");
Dataset<Row> data = dataFrameReader.load(dataPath);
/* Index labels, adding metadata to the label column.
Fit on whole dataset to include all labels in index. */
StringIndexerModel labelIndexer = new StringIndexer().setInputCol("label").setOutputCol("indexedLabel").fit(data);
/* Automatically identify categorical features, and index them.
Set maxCategories so features with > 4 distinct values are treated as continuous. */
VectorIndexerModel featureIndexer =
new VectorIndexer().setInputCol("features").setOutputCol("indexedFeatures").setMaxCategories(4).fit(data);
/* Split the data into training and test sets (30% held out for testing) */
Dataset<Row>[] splits = data.randomSplit(new double[]{0.9, 0.1});
Dataset<Row> trainingData = splits[0];
testData = splits[1];
/* Train a RandomForest model. */
RandomForestClassifier rf =
new RandomForestClassifier().setLabelCol("indexedLabel").setFeaturesCol("indexedFeatures").setNumTrees(10);
/* Convert indexed labels back to original labels. */
IndexToString labelConverter =
new IndexToString().setInputCol("prediction").setOutputCol("predictedLabel").setLabels(labelIndexer.labels());
/* Chain indexers and forest in a Pipeline */
Pipeline pipeline = new Pipeline().setStages(new PipelineStage[]{labelIndexer, featureIndexer, rf, labelConverter});
/* Train model. This also runs the indexers. */
PipelineModel model = pipeline.fit(trainingData);
/* Make predictions. */
Dataset<Row> predictions = model.transform(testData);
/* Select example rows to display. */
List<Row> predictionAsRows =
predictions.select("predictedLabel", "label", "features", "rawPrediction", "probability").collectAsList();
predictionAsRows.forEach(row -> {
System.out.println("predictedLabel: " + row.get(0) + " , " + "label: " + row.get(1) + " , " + "features: " + row.get(2) + " , " +
"predictions: " + row.get(3) + " , " + "probabilities: " + row.get(4));
});
这是输出:
predictedLabel: 1.0 , label: 1.0 , features: (4,[0,1,2,3],
[-0.833333,0.333333,-1.0,-0.916667]) , predictions: [10.0,0.0,0.0] ,
probabilities: [1.0,0.0,0.0]
predictedLabel: 1.0 , label: 1.0 , features: (4,[0,1,2,3],
[-0.555556,0.166667,-0.830508,-0.916667]) , predictions: [10.0,0.0,0.0]
, probabilities: [1.0,0.0,0.0]
predictedLabel: 2.0 , label: 2.0 , features: (4,[0,1,2,3],
[-0.333333,-0.75,0.0169491,-4.03573E-8]) , predictions: [0.0,0.0,10.0] ,
probabilities: [0.0,0.0,1.0]
predictedLabel: 2.0 , label: 2.0 , features: (4,[0,1,2,3],
[-0.166667,-0.416667,-0.0169491,-0.0833333]) , predictions:
[0.0,0.0,10.0] , probabilities: [0.0,0.0,1.0]
predictedLabel: 2.0 , label: 2.0 , features: (4,[0,1,2,3],
[0.166667,-0.25,0.118644,-4.03573E-8]) , predictions: [0.0,0.0,10.0] ,
probabilities: [0.0,0.0,1.0]
predictedLabel: 2.0 , label: 2.0 , features: (4,[0,1,2,3],
[0.277778,-0.166667,0.152542,0.0833333]) , predictions: [0.0,0.0,10.0] ,
probabilities: [0.0,0.0,1.0]
predictedLabel: 2.0 , label: 2.0 , features: (4,[0,2,3],
[0.5,0.254237,0.0833333]) , predictions: [0.0,0.0,10.0] , probabilities:
[0.0,0.0,1.0]
predictedLabel: 3.0 , label: 3.0 , features: (4,[0,1,2,3],
[-0.166667,-0.416667,0.38983,0.5]) , predictions: [0.0,9.875,0.125] ,
probabilities: [0.0,0.9875,0.0125]
predictedLabel: 3.0 , label: 3.0 , features: (4,[0,1,2,3],
[0.555555,-0.166667,0.661017,0.666667]) , predictions: [0.0,10.0,0.0] ,
probabilities: [0.0,1.0,0.0]
predictedLabel: 3.0 , label: 3.0 , features: (4,[0,1,2,3],
[0.833333,-0.166667,0.898305,0.666667]) , predictions: [0.0,10.0,0.0] ,
probabilities: [0.0,1.0,0.0]
predictedLabel: 3.0 , label: 3.0 , features: (4,[0,2,3],
[0.222222,0.38983,0.583333]) , predictions: [0.0,10.0,0.0] ,
probabilities: [0.0,1.0,0.0]
predictedLabel: 3.0 , label: 3.0 , features: (4,[0,2,3],
[0.388889,0.661017,0.833333]) , predictions: [0.0,10.0,0.0] , probabilities: [0.0,1.0,0.0]
使用 libsvm 格式,您只能为每个 class 获取一个整数,因此您无法从那里获取字符串 class 标签。
您可以通过 setLabels()
方法使用 IndexToString()
转换器。只需输入您拥有的标签数组。为此,您可能应该删除 StringIndexerModel()
(无论如何都没有必要,因为 classes 是数字,而不是字符串)。示例:
String[] labels = {"Setosa", "Versicolor", "Virginica"};
IndexToString labelConverter = new IndexToString().setInputCol("prediction").setOutputCol("predictedLabel").setLabels(labels);
您可以选择创建一个单独的 Map
,在其中将整数映射到字符串标签。对于 Iris 数据集,它可能如下所示:
Map labels = new HashMap();
labels.put(1, "Setosa");
labels.put(2, "Versicolour");
labels.put(3, "Virginica");
然后您可以使用此 Map
在完成所有 Spark 转换后获取字符串标签。
希望对您有所帮助。