H2OGridSearch H2OGBM pyspark:extractH2OParameters 中的 NullPointerException

H2OGridSearch H2OGBM pyspark: NullPointerException in extractH2OParameters

我正在尝试 运行 使用 H2O 苏打水在 pyspark 中对梯度提升机进行网格搜索。

使用著名的 iris 数据集生成了一个可重现的示例。

from pysparkling import H2OContext, H2OConf
import pyspark
from pyspark.sql.types import StructType, StructField, FloatType, StringType
from pyspark.conf import SparkConf
from pyspark.sql import SQLContext
conf = SparkConf()
conf.setMaster("local").setAppName("test")
conf.set("spark.sql.shuffle.partitions", 3)
conf.set("spark.default.parallelism", 3)
conf.set("spark.debug.maxToStringFields", 100)
sc = pyspark.SparkContext(conf=conf)
sqlContext = SQLContext(sc)
hc = H2OContext.getOrCreate(sc, H2OConf(sc).set_internal_cluster_mode())
schema = StructType([
    StructField("sepal_length", FloatType(), True),
    StructField("sepal_width", FloatType(), True),
    StructField("petal_length", FloatType(), True),
    StructField("petal_width", FloatType(), True),
    StructField("class", StringType(), True)])
iris_df = sqlContext.read \
        .format('com.databricks.spark.csv') \
        .option('header', 'false') \
        .option('delimiter', ',') \
        .schema(schema) \
        .load('../../../../Downloads/iris.data')

如果我尝试遵循 this page of H2O docs 并仅翻译为 python

gbm_params = {'learnRate': [0.01, 0.1],
              'ntrees': [100 , 200, 300, 500]}
gbm_grid = H2OGridSearch()\
    .setLabelCol("class") \
    .setHyperParameters(gbm_params)\
    .setAlgo(H2OGBM().setMaxDepth(30))

model_pipeline = Pipeline().setStages([gbm_grid])
model = model_pipeline.fit(iris_df)

我得到一个内部 NullPointerException,我想配置中缺少某些东西。

Py4JJavaError: An error occurred while calling o111.fit.
: java.lang.NullPointerException
    at ai.h2o.sparkling.ml.algos.H2OGridSearch.extractH2OParameters(H2OGridSearch.scala:352)
    at ai.h2o.sparkling.ml.algos.H2OGridSearch.fit(H2OGridSearch.scala:64)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
    at java.lang.reflect.Method.invoke(Unknown Source)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Unknown Source)

如果我尝试以不同的方式重写它,我会得到不同的错误,

gbm_grid = H2OGridSearch(algo=H2OGBM().setMaxDepth(30),
                         hyperParameters={'learnRate': [0.01, 0.1]},
                         withDetailedPredictionCol=True,
                         labelCol='class',
                         stoppingMetric="AUC")
model_pipeline = Pipeline().setStages([gbm_grid])
model = model_pipeline.fit(iris_df)

这是输出,不管我怎么改超参数,

Py4JJavaError: An error occurred while calling o1817.fit.
: java.lang.NoSuchFieldException: learnRate
    at java.lang.Class.getField(Unknown Source)
    at ai.h2o.sparkling.ml.algos.H2OGridSearch.findField(H2OGridSearch.scala:170)
    at ai.h2o.sparkling.ml.algos.H2OGridSearch.processHyperParams(H2OGridSearch.scala:154)
    at ai.h2o.sparkling.ml.algos.H2OGridSearch.fit(H2OGridSearch.scala:71)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
    at java.lang.reflect.Method.invoke(Unknown Source)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Unknown Source)

以下是可行的,但是由于没有网格所以没有用,

gbm_grid = H2OGridSearch(algo=H2OGBM().setMaxDepth(30),
                         #hyperParameters=gbm_params,
                         withDetailedPredictionCol=True,
                         labelCol='class',
                         stoppingMetric="AUC")
model_pipeline = Pipeline().setStages([gbm_grid])
model = model_pipeline.fit(iris_df)
model.stages[0].transform(iris_df).head()

最后,只是为了确保 learnRate 是 H2OGBM 的参数,这也有效,

gbm_model = H2OGBM(labelCol='class',
                   withDetailedPredictionCol=True).setLearnRate(0.01).setMaxDepth(5).setNtrees(100)

model_pipeline = Pipeline().setStages([gbm_model])
model = model_pipeline.fit(iris_df)
model.stages[0].transform(iris_df).head()

编辑:缺少导入

from pyspark.ml.pipeline import Pipeline
from ai.h2o.sparkling.ml.algos import H2OGridSearch
from ai.h2o.sparkling.ml.algos import H2OGBM

和火花水版

h2o-pysparkling-2-4       3.28.0.1-1               pypi_0    pypi

在对 Spark/H2O/Java 个版本

发表评论后进行编辑

星火:2.4.4

H2O: 3.28.0.3

Java: 1.8.0_232


编辑java-版本

openjdk version "1.8.0_242"
OpenJDK Runtime Environment (build 1.8.0_242-8u242-b08-0ubuntu3~16.04-b08)
OpenJDK 64-Bit Server VM (build 25.242-b08, mixed mode)

如果我使用 learn_rate 而不是 learnRate,也会出现同样的错误。

gbm_grid = H2OGridSearch(algo=H2OGBM().setMaxDepth(30),
                         hyperParameters={'learn_rate': [0.01, 0.1]},
                         withDetailedPredictionCol=True,
                         labelCol='class',
                         stoppingMetric="AUC")
model_pipeline = Pipeline().setStages([gbm_grid])
model = model_pipeline.fit(iris_df)

...

Py4JJavaError: An error occurred while calling o1376.fit.
: java.lang.NoSuchFieldException: learn_rate
    at java.lang.Class.getField(Class.java:1703)
    at ai.h2o.sparkling.ml.algos.H2OGridSearch.findField(H2OGridSearch.scala:170)
    at ai.h2o.sparkling.ml.algos.H2OGridSearch.processHyperParams(H2OGridSearch.scala:154)
    at ai.h2o.sparkling.ml.algos.H2OGridSearch.fit(H2OGridSearch.scala:71)
    at ai.h2o.sparkling.ml.algos.H2OGridSearch.fit(H2OGridSearch.scala:52)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)

为什么不使用变通方法并利用 H2O UI 创建网格?有一个复选框可以让您选择的参数网格化,您可以通过网络表单以逗号分隔的列表形式提供参数值,您通常会在其中输入单个值。

有一个解决方法 here 我没有注意到(可能我应该首先将它作为错误发布在 github 中)。

gbm_grid = H2OGridSearch(algo=H2OGBM().setMaxDepth(30),
                         hyperParameters={'_learn_rate':[0.01, 0.1], '_ntrees': [100, 200]},
                         withDetailedPredictionCol=True,
                         labelCol='class',
                         stoppingMetric="AUC")
model_pipeline = Pipeline().setStages([gbm_grid])
model = model_pipeline.fit(iris_df)
model.stages[0].transform(iris_df).head()