Spark ML 中的维度不匹配错误
Dimension mismatch error in Spark ML
我对 ML 和 Spark ML 都很陌生,我正在尝试使用带有 Spark ML 的神经网络来制作预测模型,但是当我调用我的 .transform
方法时出现此错误学习模型。问题是由使用 OneHotEncoder 引起的,因为没有它一切正常。
我试过将 OneHotEncoder 从管道中取出。
我的问题是:如何使用 OneHotEncoder 而不会出现此错误?
java.lang.IllegalArgumentException: requirement failed: A & B Dimension mismatch!
at scala.Predef$.require(Predef.scala:224) at org.apache.spark.ml.ann.BreezeUtil$.dgemm(BreezeUtil.scala:41) at
org.apache.spark.ml.ann.AffineLayerModel.eval(Layer.scala:163) at
org.apache.spark.ml.ann.FeedForwardModel.forward(Layer.scala:482) at
org.apache.spark.ml.ann.FeedForwardModel.predict(Layer.scala:529)
我的代码:
test_pandas_df = pd.read_csv(
'/home/piotrek/ml/adults/adult.test', names=header, skipinitialspace=True)
train_pandas_df = pd.read_csv(
'/home/piotrek/ml/adults/adult.data', names=header, skipinitialspace=True)
train_df = sqlContext.createDataFrame(train_pandas_df)
test_df = sqlContext.createDataFrame(test_pandas_df)
joined = train_df.union(test_df)
assembler = VectorAssembler().setInputCols(features).setOutputCol("features")
label_indexer = StringIndexer().setInputCol(
"label").setOutputCol("label_index")
label_indexer_fit = [label_indexer.fit(joined)]
string_indexers = [StringIndexer().setInputCol(
name).setOutputCol(name + "_index").fit(joined) for name in categorical_feats]
one_hot_pipeline = Pipeline().setStages([OneHotEncoder().setInputCol(
name + '_index').setOutputCol(name + '_one_hot') for name in categorical_feats])
mlp = MultilayerPerceptronClassifier().setLabelCol(label_indexer.getOutputCol()).setFeaturesCol(
assembler.getOutputCol()).setLayers([len(features), 20, 10, 2]).setSeed(42L).setBlockSize(1000).setMaxIter(500)
pipeline = Pipeline().setStages(label_indexer_fit
+ string_indexers + [one_hot_pipeline] + [assembler, mlp])
model = pipeline.fit(train_df)
# compute accuracy on the test set
result = model.transform(test_df)
## FAILS ON RESULT
predictionAndLabels = result.select("prediction", "label_index")
evaluator = MulticlassClassificationEvaluator(labelCol="label_index")
print "-------------------------------"
print("Test set accuracy = " + str(evaluator.evaluate(predictionAndLabels)))
print "-------------------------------"
谢谢!
layers
Param
你的模型不正确:
setLayers([len(features), 20, 10, 2])
第一层应反映输入特征的数量,通常不会与编码前的原始列数相同。
如果您不知道预先的特征总数,您可以将特征提取和模型训练分开。伪代码:
feature_pipeline_model = (Pipeline()
.setStages(...) # Only feature extraction
.fit(train_df))
train_df_features = feature_pipeline_model.transform(train_df)
layers = [
train_df_features.schema["features"].metadata["ml_attr"]["num_attrs"],
20, 10, 2
]
我遇到了同样的问题,对 user6910411 的建议采取了更手动的方法。例如,我有
layers = [**100**, 100, 100 ,100]
但是我的输入变量个数实际上是199,所以我就改成了
layers = [**199**, 100, 100 ,100]
问题似乎已解决。 :-D
我对 ML 和 Spark ML 都很陌生,我正在尝试使用带有 Spark ML 的神经网络来制作预测模型,但是当我调用我的 .transform
方法时出现此错误学习模型。问题是由使用 OneHotEncoder 引起的,因为没有它一切正常。
我试过将 OneHotEncoder 从管道中取出。
我的问题是:如何使用 OneHotEncoder 而不会出现此错误?
java.lang.IllegalArgumentException: requirement failed: A & B Dimension mismatch!
at scala.Predef$.require(Predef.scala:224) at org.apache.spark.ml.ann.BreezeUtil$.dgemm(BreezeUtil.scala:41) at
org.apache.spark.ml.ann.AffineLayerModel.eval(Layer.scala:163) at
org.apache.spark.ml.ann.FeedForwardModel.forward(Layer.scala:482) at
org.apache.spark.ml.ann.FeedForwardModel.predict(Layer.scala:529)
我的代码:
test_pandas_df = pd.read_csv(
'/home/piotrek/ml/adults/adult.test', names=header, skipinitialspace=True)
train_pandas_df = pd.read_csv(
'/home/piotrek/ml/adults/adult.data', names=header, skipinitialspace=True)
train_df = sqlContext.createDataFrame(train_pandas_df)
test_df = sqlContext.createDataFrame(test_pandas_df)
joined = train_df.union(test_df)
assembler = VectorAssembler().setInputCols(features).setOutputCol("features")
label_indexer = StringIndexer().setInputCol(
"label").setOutputCol("label_index")
label_indexer_fit = [label_indexer.fit(joined)]
string_indexers = [StringIndexer().setInputCol(
name).setOutputCol(name + "_index").fit(joined) for name in categorical_feats]
one_hot_pipeline = Pipeline().setStages([OneHotEncoder().setInputCol(
name + '_index').setOutputCol(name + '_one_hot') for name in categorical_feats])
mlp = MultilayerPerceptronClassifier().setLabelCol(label_indexer.getOutputCol()).setFeaturesCol(
assembler.getOutputCol()).setLayers([len(features), 20, 10, 2]).setSeed(42L).setBlockSize(1000).setMaxIter(500)
pipeline = Pipeline().setStages(label_indexer_fit
+ string_indexers + [one_hot_pipeline] + [assembler, mlp])
model = pipeline.fit(train_df)
# compute accuracy on the test set
result = model.transform(test_df)
## FAILS ON RESULT
predictionAndLabels = result.select("prediction", "label_index")
evaluator = MulticlassClassificationEvaluator(labelCol="label_index")
print "-------------------------------"
print("Test set accuracy = " + str(evaluator.evaluate(predictionAndLabels)))
print "-------------------------------"
谢谢!
layers
Param
你的模型不正确:
setLayers([len(features), 20, 10, 2])
第一层应反映输入特征的数量,通常不会与编码前的原始列数相同。
如果您不知道预先的特征总数,您可以将特征提取和模型训练分开。伪代码:
feature_pipeline_model = (Pipeline()
.setStages(...) # Only feature extraction
.fit(train_df))
train_df_features = feature_pipeline_model.transform(train_df)
layers = [
train_df_features.schema["features"].metadata["ml_attr"]["num_attrs"],
20, 10, 2
]
我遇到了同样的问题,对 user6910411 的建议采取了更手动的方法。例如,我有
layers = [**100**, 100, 100 ,100]
但是我的输入变量个数实际上是199,所以我就改成了
layers = [**199**, 100, 100 ,100]
问题似乎已解决。 :-D