使用 Spark 3 加载 PipelineModel 时出现 AnalysisException
AnalysisException when loading a PipelineModel with Spark 3
我正在将我的 Spark 版本从 2.4.5 升级到 3.0.1,我无法再加载使用“DecisionTreeClassifier”阶段的 PipelineModel 对象。
在我的代码中,我加载了几个 PipelineModel,所有带有阶段 ["CountVectorizer_[uid]", "LinearSVC_[uid]"] 的 PipelineModel 加载正常,而带有阶段的模型
["CountVectorizer_[uid]","DecisionTreeClassifier_[uid]"] 抛出以下异常:
AnalysisException: cannot resolve 'rawCount
' given input columns:
[gain, id, impurity, impurityStats, leftChild, prediction, rightChild,
split]
这是我正在使用的代码和完整的堆栈跟踪:
from pyspark.ml.pipeline import PipelineModel
PipelineModel.load("/path/to/model")
AnalysisException Traceback (most recent call last)
<command-1278858167154148> in <module>
----> 1 RalentModel = PipelineModel.load(MODELES_ATTRIBUTS + "RalentModel_DT")/databricks/spark/python/pyspark/ml/util.py in load(cls, path)
368 def load(cls, path):
369 """Reads an ML instance from the input path, a shortcut of `read().load(path)`."""
--> 370 return cls.read().load(path)
371
372 /databricks/spark/python/pyspark/ml/pipeline.py in load(self, path)
289 metadata = DefaultParamsReader.loadMetadata(path, self.sc)
290 if 'language' not in metadata['paramMap'] or metadata['paramMap']['language'] != 'Python':
--> 291 return JavaMLReader(self.cls).load(path)
292 else:
293 uid, stages = PipelineSharedReadWrite.load(metadata, self.sc, path)/databricks/spark/python/pyspark/ml/util.py in load(self, path)
318 if not isinstance(path, basestring):
319 raise TypeError("path should be a basestring, got type %s" % type(path))
--> 320 java_obj = self._jread.load(path)
321 if not hasattr(self._clazz, "_from_java"):
322 raise NotImplementedError("This Java ML type cannot be loaded into Python currently: %r"/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in __call__(self, *args)
1303 answer = self.gateway_client.send_command(command)
1304 return_value = get_return_value(
-> 1305 answer, self.gateway_client, self.target_id, self.name)
1306
1307 for temp_arg in temp_args:/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
131 # Hide where the exception came from that shows a non-Pythonic
132 # JVM exception message.
--> 133 raise_from(converted)
134 else:
135 raise/databricks/spark/python/pyspark/sql/utils.py in raise_from(e)
AnalysisException: cannot resolve '`rawCount`' given input columns: [gain, id, impurity, impurityStats, leftChild, prediction, rightChild, split];
这些管道模型是使用 Spark 2.4.3 保存的,我可以使用 Spark 2.4.5 加载它们。
我试图进一步调查并分别加载每个阶段。使用
加载 CountVectorizerModel
from pyspark.ml.feature import CountVectorizerModel
CountVectorizerModel.read().load("/path/to/model/stages/0_CountVectorizer_efce893314a9")
产生一个 CountVectorizerModel,所以它可以工作,但是我的代码在尝试加载 DecisionTreeClassificationModel 时失败了:
DecisionTreeClassificationModel.read().load("/path/to/model/stages/1_DecisionTreeClassifier_4d2a76c565b0")
AnalysisException: cannot resolve '`rawCount`' given input columns: [gain, id, impurity, impurityStats, leftChild, prediction, rightChild, split];
这里是我的决策树分类器“数据”的内容:
spark.read.parquet("/path/to/model/stages/1_DecisionTreeClassifier_4d2a76c565b0/data").show()
+---+----------+--------------------+-------------+--------------------+---------+----------+----------------+
| id|prediction| impurity|impurityStats| gain|leftChild|rightChild| split|
+---+----------+--------------------+-------------+--------------------+---------+----------+----------------+
| 0| 0.0| 0.3926234384295062| [90.0, 33.0]| 0.16011830963990054| 1| 16|[190, [0.5], -1]|
| 1| 0.0| 0.2672722508516028| [90.0, 17.0]| 0.11434106988303855| 2| 15|[512, [0.5], -1]|
| 2| 0.0| 0.1652892561983472| [90.0, 9.0]| 0.06959547629404085| 3| 14|[583, [0.5], -1]|
| 3| 0.0| 0.09972299168975082| [90.0, 5.0]|0.026984966852376356| 4| 11|[480, [0.5], -1]|
| 4| 0.0|0.043933846736523306| [87.0, 2.0]|0.021717299239076976| 5| 10|[555, [1.5], -1]|
| 5| 0.0|0.022469008264462766| [87.0, 1.0]|0.011105371900826402| 6| 7|[833, [0.5], -1]|
| 6| 0.0| 0.0| [86.0, 0.0]| -1.0| -1| -1| [-1, [], -1]|
| 7| 0.0| 0.5| [1.0, 1.0]| 0.5| 8| 9| [0, [0.5], -1]|
| 8| 0.0| 0.0| [1.0, 0.0]| -1.0| -1| -1| [-1, [], -1]|
| 9| 1.0| 0.0| [0.0, 1.0]| -1.0| -1| -1| [-1, [], -1]|
| 10| 1.0| 0.0| [0.0, 1.0]| -1.0| -1| -1| [-1, [], -1]|
| 11| 0.0| 0.5| [3.0, 3.0]| 0.5| 12| 13| [14, [1.5], -1]|
| 12| 0.0| 0.0| [3.0, 0.0]| -1.0| -1| -1| [-1, [], -1]|
| 13| 1.0| 0.0| [0.0, 3.0]| -1.0| -1| -1| [-1, [], -1]|
| 14| 1.0| 0.0| [0.0, 4.0]| -1.0| -1| -1| [-1, [], -1]|
| 15| 1.0| 0.0| [0.0, 8.0]| -1.0| -1| -1| [-1, [], -1]|
| 16| 1.0| 0.0| [0.0, 16.0]| -1.0| -1| -1| [-1, [], -1]|
+---+----------+--------------------+-------------+--------------------+---------+----------+----------------+
这是一个错误,我在这里提交了一个问题:https://issues.apache.org/jira/browse/SPARK-33398, it was resolved in this PR: https://github.com/apache/spark/pull/30889
我正在将我的 Spark 版本从 2.4.5 升级到 3.0.1,我无法再加载使用“DecisionTreeClassifier”阶段的 PipelineModel 对象。
在我的代码中,我加载了几个 PipelineModel,所有带有阶段 ["CountVectorizer_[uid]", "LinearSVC_[uid]"] 的 PipelineModel 加载正常,而带有阶段的模型 ["CountVectorizer_[uid]","DecisionTreeClassifier_[uid]"] 抛出以下异常:
AnalysisException: cannot resolve '
rawCount
' given input columns: [gain, id, impurity, impurityStats, leftChild, prediction, rightChild, split]
这是我正在使用的代码和完整的堆栈跟踪:
from pyspark.ml.pipeline import PipelineModel
PipelineModel.load("/path/to/model")
AnalysisException Traceback (most recent call last)
<command-1278858167154148> in <module>
----> 1 RalentModel = PipelineModel.load(MODELES_ATTRIBUTS + "RalentModel_DT")/databricks/spark/python/pyspark/ml/util.py in load(cls, path)
368 def load(cls, path):
369 """Reads an ML instance from the input path, a shortcut of `read().load(path)`."""
--> 370 return cls.read().load(path)
371
372 /databricks/spark/python/pyspark/ml/pipeline.py in load(self, path)
289 metadata = DefaultParamsReader.loadMetadata(path, self.sc)
290 if 'language' not in metadata['paramMap'] or metadata['paramMap']['language'] != 'Python':
--> 291 return JavaMLReader(self.cls).load(path)
292 else:
293 uid, stages = PipelineSharedReadWrite.load(metadata, self.sc, path)/databricks/spark/python/pyspark/ml/util.py in load(self, path)
318 if not isinstance(path, basestring):
319 raise TypeError("path should be a basestring, got type %s" % type(path))
--> 320 java_obj = self._jread.load(path)
321 if not hasattr(self._clazz, "_from_java"):
322 raise NotImplementedError("This Java ML type cannot be loaded into Python currently: %r"/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in __call__(self, *args)
1303 answer = self.gateway_client.send_command(command)
1304 return_value = get_return_value(
-> 1305 answer, self.gateway_client, self.target_id, self.name)
1306
1307 for temp_arg in temp_args:/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
131 # Hide where the exception came from that shows a non-Pythonic
132 # JVM exception message.
--> 133 raise_from(converted)
134 else:
135 raise/databricks/spark/python/pyspark/sql/utils.py in raise_from(e)
AnalysisException: cannot resolve '`rawCount`' given input columns: [gain, id, impurity, impurityStats, leftChild, prediction, rightChild, split];
这些管道模型是使用 Spark 2.4.3 保存的,我可以使用 Spark 2.4.5 加载它们。
我试图进一步调查并分别加载每个阶段。使用
加载 CountVectorizerModelfrom pyspark.ml.feature import CountVectorizerModel
CountVectorizerModel.read().load("/path/to/model/stages/0_CountVectorizer_efce893314a9")
产生一个 CountVectorizerModel,所以它可以工作,但是我的代码在尝试加载 DecisionTreeClassificationModel 时失败了:
DecisionTreeClassificationModel.read().load("/path/to/model/stages/1_DecisionTreeClassifier_4d2a76c565b0")
AnalysisException: cannot resolve '`rawCount`' given input columns: [gain, id, impurity, impurityStats, leftChild, prediction, rightChild, split];
这里是我的决策树分类器“数据”的内容:
spark.read.parquet("/path/to/model/stages/1_DecisionTreeClassifier_4d2a76c565b0/data").show()
+---+----------+--------------------+-------------+--------------------+---------+----------+----------------+
| id|prediction| impurity|impurityStats| gain|leftChild|rightChild| split|
+---+----------+--------------------+-------------+--------------------+---------+----------+----------------+
| 0| 0.0| 0.3926234384295062| [90.0, 33.0]| 0.16011830963990054| 1| 16|[190, [0.5], -1]|
| 1| 0.0| 0.2672722508516028| [90.0, 17.0]| 0.11434106988303855| 2| 15|[512, [0.5], -1]|
| 2| 0.0| 0.1652892561983472| [90.0, 9.0]| 0.06959547629404085| 3| 14|[583, [0.5], -1]|
| 3| 0.0| 0.09972299168975082| [90.0, 5.0]|0.026984966852376356| 4| 11|[480, [0.5], -1]|
| 4| 0.0|0.043933846736523306| [87.0, 2.0]|0.021717299239076976| 5| 10|[555, [1.5], -1]|
| 5| 0.0|0.022469008264462766| [87.0, 1.0]|0.011105371900826402| 6| 7|[833, [0.5], -1]|
| 6| 0.0| 0.0| [86.0, 0.0]| -1.0| -1| -1| [-1, [], -1]|
| 7| 0.0| 0.5| [1.0, 1.0]| 0.5| 8| 9| [0, [0.5], -1]|
| 8| 0.0| 0.0| [1.0, 0.0]| -1.0| -1| -1| [-1, [], -1]|
| 9| 1.0| 0.0| [0.0, 1.0]| -1.0| -1| -1| [-1, [], -1]|
| 10| 1.0| 0.0| [0.0, 1.0]| -1.0| -1| -1| [-1, [], -1]|
| 11| 0.0| 0.5| [3.0, 3.0]| 0.5| 12| 13| [14, [1.5], -1]|
| 12| 0.0| 0.0| [3.0, 0.0]| -1.0| -1| -1| [-1, [], -1]|
| 13| 1.0| 0.0| [0.0, 3.0]| -1.0| -1| -1| [-1, [], -1]|
| 14| 1.0| 0.0| [0.0, 4.0]| -1.0| -1| -1| [-1, [], -1]|
| 15| 1.0| 0.0| [0.0, 8.0]| -1.0| -1| -1| [-1, [], -1]|
| 16| 1.0| 0.0| [0.0, 16.0]| -1.0| -1| -1| [-1, [], -1]|
+---+----------+--------------------+-------------+--------------------+---------+----------+----------------+
这是一个错误,我在这里提交了一个问题:https://issues.apache.org/jira/browse/SPARK-33398, it was resolved in this PR: https://github.com/apache/spark/pull/30889