H2O H2OServerError: HTTP 500 Server Error when training model
H2O H2OServerError: HTTP 500 Server Error when training model
尝试在 h2o(版本 3.20.0.5)中训练 DRF 分类器,错误 "H2OServerError: HTTP 500 Server Error",没有进一步的解释。
---------------------------------------------------------------------------
H2OServerError Traceback (most recent call last)
<ipython-input-44-f52d1cb4b77a> in <module>()
4 training_frame=train_u, validation_frame=val_u,
5 weights_column='weight',
----> 6 max_runtime_secs=max_train_time_hrs*60*60)
7
8
/home/mapr/python-virtual-envs/ml1c/venv/lib/python2.7/site-packages/h2o/estimators/estimator_base.pyc in train(self, x, y, training_frame, offset_column, fold_column, weights_column, validation_frame, max_runtime_secs, ignored_columns, model_id, verbose)
224 rest_ver = parms.pop("_rest_version") if "_rest_version" in parms else 3
225
--> 226 model_builder_json = h2o.api("POST /%d/ModelBuilders/%s" % (rest_ver, self.algo), data=parms)
227 model = H2OJob(model_builder_json, job_type=(self.algo + " Model Build"))
228
/home/mapr/python-virtual-envs/ml1c/venv/lib/python2.7/site-packages/h2o/h2o.pyc in api(endpoint, data, json, filename, save_to)
101 # type checks are performed in H2OConnection class
102 _check_connection()
--> 103 return h2oconn.request(endpoint, data=data, json=json, filename=filename, save_to=save_to)
104
105
/home/mapr/python-virtual-envs/ml1c/venv/lib/python2.7/site-packages/h2o/backend/connection.pyc in request(self, endpoint, data, json, filename, save_to)
400 auth=self._auth, verify=self._verify_ssl_cert, proxies=self._proxies)
401 self._log_end_transaction(start_time, resp)
--> 402 return self._process_response(resp, save_to)
403
404 except (requests.exceptions.ConnectionError, requests.exceptions.HTTPError) as e:
/home/mapr/python-virtual-envs/ml1c/venv/lib/python2.7/site-packages/h2o/backend/connection.pyc in _process_response(response, save_to)
728 # Note that it is possible to receive valid H2OErrorV3 object in this case, however it merely means the server
729 # did not provide the correct status code.
--> 730 raise H2OServerError("HTTP %d %s:\n%r" % (status_code, response.reason, data))
731
732
H2OServerError: HTTP 500 Server Error:
Server error java.lang.NullPointerException:
Error: Caught exception: java.lang.NullPointerException
Request: None
有问题的代码片段如下所示:
max_train_time_hrs = 8
drf_proc.train(
x=train_features, y=train_response,
training_frame=train_u, validation_frame=val_u,
weights_column='weight',
max_runtime_secs=max_train_time_hrs*60*60)
运行 h2o.init()
命令的输出类似于
Checking whether there is an H2O instance running at http://172.18.4.62:54321. connected.
Warning: Your H2O cluster version is too old (7 months and 24 days)! Please download and install the latest version from http://h2o.ai/download/
H2O cluster uptime: 06 secs
H2O cluster timezone: Pacific/Honolulu
H2O data parsing timezone: UTC
H2O cluster version: 3.20.0.5
H2O cluster version age: 7 months and 24 days !!!
H2O cluster name: H2O_88021
H2O cluster total nodes: 4
H2O cluster free memory: 15.34 Gb
H2O cluster total cores: 8
H2O cluster allowed cores: 8
H2O cluster status: accepting new members, healthy
H2O connection url: http://172.18.4.62:54321
H2O connection proxy: None
H2O internal security: False
H2O API Extensions: AutoML, XGBoost, Algos, Core V3, Core V4
Python version: 2.7.12 fin
虽然我意识到有警告说我正在使用的 h2o 版本是 "too old",但我正在使用的 h2o 版本 python 包和我正在连接的集群仍然匹配并且由于其他 h2o 应用程序访问此集群并期望某个版本(所有这些应用程序似乎在集群上都没有问题 运行),因此无法升级。同时,任何网络浏览器都无法连接到 H2O 连接 url。
关于这里可能发生的事情或可以研究的调试步骤的任何想法?
15GB 内存可能不足以满足您预计持续 8 小时的训练过程。 (除此之外:我建议使用 early stopping,而不是 max_runtime_secs
。)
作为调试步骤,我建议在 Flow 界面中观看(将浏览器指向端口 54321 - 查看 h2o.init()
输出中的连接 URL)。特别要注意内存使用量是如何随着时间的推移而上升的。
(有时“500”错误仅表示它变得不稳定,内存不足是常见的触发因素。)
如果您立即收到错误,则不太可能是问题所在(除非您有庞大的数据集)。
在那种情况下,如果特定列或数据行可能导致问题,我会尝试缩小范围。例如。
- 实验 1:
train_features
中的前半列
- 实验 2:
train_features
中的后半列
- 实验 3:
train_u
中的前半行
- 实验 4:
train_u
中的后半行
- 实验 5/6(如果仍然不走运):与
valid_u
相同
如果实验对中有一个崩溃而另一个没有,则在崩溃的一半上重复实验。
尝试在 h2o(版本 3.20.0.5)中训练 DRF 分类器,错误 "H2OServerError: HTTP 500 Server Error",没有进一步的解释。
---------------------------------------------------------------------------
H2OServerError Traceback (most recent call last)
<ipython-input-44-f52d1cb4b77a> in <module>()
4 training_frame=train_u, validation_frame=val_u,
5 weights_column='weight',
----> 6 max_runtime_secs=max_train_time_hrs*60*60)
7
8
/home/mapr/python-virtual-envs/ml1c/venv/lib/python2.7/site-packages/h2o/estimators/estimator_base.pyc in train(self, x, y, training_frame, offset_column, fold_column, weights_column, validation_frame, max_runtime_secs, ignored_columns, model_id, verbose)
224 rest_ver = parms.pop("_rest_version") if "_rest_version" in parms else 3
225
--> 226 model_builder_json = h2o.api("POST /%d/ModelBuilders/%s" % (rest_ver, self.algo), data=parms)
227 model = H2OJob(model_builder_json, job_type=(self.algo + " Model Build"))
228
/home/mapr/python-virtual-envs/ml1c/venv/lib/python2.7/site-packages/h2o/h2o.pyc in api(endpoint, data, json, filename, save_to)
101 # type checks are performed in H2OConnection class
102 _check_connection()
--> 103 return h2oconn.request(endpoint, data=data, json=json, filename=filename, save_to=save_to)
104
105
/home/mapr/python-virtual-envs/ml1c/venv/lib/python2.7/site-packages/h2o/backend/connection.pyc in request(self, endpoint, data, json, filename, save_to)
400 auth=self._auth, verify=self._verify_ssl_cert, proxies=self._proxies)
401 self._log_end_transaction(start_time, resp)
--> 402 return self._process_response(resp, save_to)
403
404 except (requests.exceptions.ConnectionError, requests.exceptions.HTTPError) as e:
/home/mapr/python-virtual-envs/ml1c/venv/lib/python2.7/site-packages/h2o/backend/connection.pyc in _process_response(response, save_to)
728 # Note that it is possible to receive valid H2OErrorV3 object in this case, however it merely means the server
729 # did not provide the correct status code.
--> 730 raise H2OServerError("HTTP %d %s:\n%r" % (status_code, response.reason, data))
731
732
H2OServerError: HTTP 500 Server Error:
Server error java.lang.NullPointerException:
Error: Caught exception: java.lang.NullPointerException
Request: None
有问题的代码片段如下所示:
max_train_time_hrs = 8
drf_proc.train(
x=train_features, y=train_response,
training_frame=train_u, validation_frame=val_u,
weights_column='weight',
max_runtime_secs=max_train_time_hrs*60*60)
运行 h2o.init()
命令的输出类似于
Checking whether there is an H2O instance running at http://172.18.4.62:54321. connected.
Warning: Your H2O cluster version is too old (7 months and 24 days)! Please download and install the latest version from http://h2o.ai/download/
H2O cluster uptime: 06 secs
H2O cluster timezone: Pacific/Honolulu
H2O data parsing timezone: UTC
H2O cluster version: 3.20.0.5
H2O cluster version age: 7 months and 24 days !!!
H2O cluster name: H2O_88021
H2O cluster total nodes: 4
H2O cluster free memory: 15.34 Gb
H2O cluster total cores: 8
H2O cluster allowed cores: 8
H2O cluster status: accepting new members, healthy
H2O connection url: http://172.18.4.62:54321
H2O connection proxy: None
H2O internal security: False
H2O API Extensions: AutoML, XGBoost, Algos, Core V3, Core V4
Python version: 2.7.12 fin
虽然我意识到有警告说我正在使用的 h2o 版本是 "too old",但我正在使用的 h2o 版本 python 包和我正在连接的集群仍然匹配并且由于其他 h2o 应用程序访问此集群并期望某个版本(所有这些应用程序似乎在集群上都没有问题 运行),因此无法升级。同时,任何网络浏览器都无法连接到 H2O 连接 url。
关于这里可能发生的事情或可以研究的调试步骤的任何想法?
15GB 内存可能不足以满足您预计持续 8 小时的训练过程。 (除此之外:我建议使用 early stopping,而不是 max_runtime_secs
。)
作为调试步骤,我建议在 Flow 界面中观看(将浏览器指向端口 54321 - 查看 h2o.init()
输出中的连接 URL)。特别要注意内存使用量是如何随着时间的推移而上升的。
(有时“500”错误仅表示它变得不稳定,内存不足是常见的触发因素。)
如果您立即收到错误,则不太可能是问题所在(除非您有庞大的数据集)。
在那种情况下,如果特定列或数据行可能导致问题,我会尝试缩小范围。例如。
- 实验 1:
train_features
中的前半列
- 实验 2:
train_features
中的后半列
- 实验 3:
train_u
中的前半行
- 实验 4:
train_u
中的后半行
- 实验 5/6(如果仍然不走运):与
valid_u
相同
如果实验对中有一个崩溃而另一个没有,则在崩溃的一半上重复实验。