H2O 目标均值编码器 "frames are being sent in the same order" 错误
H2O Target Mean Encoder "frames are being sent in the same order" ERROR
我正在按照 H2O 示例 运行 在 Sparking Water(sparking water 2.4.2 和 H2O 3.22.04)中进行目标均值编码。 运行以下所有段落都很好
from h2o.targetencoder import TargetEncoder
# change label to factor
input_df_h2o['label'] = input_df_h2o['label'].asfactor()
# add fold column for Target Encoding
input_df_h2o["cv_fold_te"] = input_df_h2o.kfold_column(n_folds = 5, seed = 54321)
# find all categorical features
cat_features = [k for (k,v) in input_df_h2o.types.items() if v in ('string')]
# convert string to factor
for i in cat_features:
input_df_h2o[i] = input_df_h2o[i].asfactor()
# target mean encode
targetEncoder = TargetEncoder(x= cat_features, y = y, fold_column = "cv_fold_te", blending_avg=True)
targetEncoder.fit(input_df_h2o)
但是当我开始使用用于将 Target Encoder 拟合到 运行 转换代码的相同数据集时(参见下面的代码):
ext_input_df_h2o = targetEncoder.transform(frame=input_df_h2o,
holdout_type="kfold", # mean is calculating on out-of-fold data only; loo means leave one out
is_train_or_valid=True,
noise = 0, # determines if random noise should be added to the target average
seed=54321)
我会有错误喜欢
Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-6773422589366407956.py", line 331, in <module>
exec(code)
File "<stdin>", line 5, in <module>
File "/usr/lib/envs/env-1101-ver-1619-a-4.2.9-py-3.5.3/lib/python3.5/site-packages/h2o/targetencoder.py", line 97, in transform
assert self._encodingMap.map_keys['string'] == self._teColumns
AssertionError
我在它的源代码中找到了代码http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/_modules/h2o/targetencoder.html
但如何解决这个问题? table 用于 运行 fit。
问题是因为您正在尝试对多个分类特征进行编码。我认为这是 H2O 的一个错误,但您可以解决将转换器放在循环遍历所有类别名称的 for 循环中的问题。
import numpy as np
import pandas as pd
import h2o
from h2o.targetencoder import TargetEncoder
h2o.init()
df = pd.DataFrame({
'x_0': ['a'] * 5 + ['b'] * 5,
'x_1': ['c'] * 9 + ['d'] * 1,
'x_2': ['a'] * 3 + ['b'] * 7,
'y_0': [1, 1, 1, 1, 0, 1, 0, 0, 0, 0]
})
hf = h2o.H2OFrame(df)
hf['cv_fold_te'] = hf.kfold_column(n_folds=2, seed=54321)
hf['y_0'] = hf['y_0'].asfactor()
cat_features = ['x_0', 'x_1', 'x_2']
for item in cat_features:
target_encoder = TargetEncoder(x=[item], y='y_0', fold_column = 'cv_fold_te')
target_encoder.fit(hf)
hf = target_encoder.transform(frame=hf, holdout_type='kfold',
seed=54321, noise=0.0)
hf
感谢大家告诉我们。断言是一种预防措施,因为我不确定是否可以更改订单。其余代码是在考虑到这一假设的情况下编写的,因此无论如何都可以安全地使用更改后的顺序,但是断言被遗忘了。添加了测试并删除了断言。现在这个问题已修复并合并。应该在即将发布的修复版本中可用。 0xdata.atlassian.net/browse/PUBDEV-6474
我正在按照 H2O 示例 运行 在 Sparking Water(sparking water 2.4.2 和 H2O 3.22.04)中进行目标均值编码。 运行以下所有段落都很好
from h2o.targetencoder import TargetEncoder
# change label to factor
input_df_h2o['label'] = input_df_h2o['label'].asfactor()
# add fold column for Target Encoding
input_df_h2o["cv_fold_te"] = input_df_h2o.kfold_column(n_folds = 5, seed = 54321)
# find all categorical features
cat_features = [k for (k,v) in input_df_h2o.types.items() if v in ('string')]
# convert string to factor
for i in cat_features:
input_df_h2o[i] = input_df_h2o[i].asfactor()
# target mean encode
targetEncoder = TargetEncoder(x= cat_features, y = y, fold_column = "cv_fold_te", blending_avg=True)
targetEncoder.fit(input_df_h2o)
但是当我开始使用用于将 Target Encoder 拟合到 运行 转换代码的相同数据集时(参见下面的代码):
ext_input_df_h2o = targetEncoder.transform(frame=input_df_h2o,
holdout_type="kfold", # mean is calculating on out-of-fold data only; loo means leave one out
is_train_or_valid=True,
noise = 0, # determines if random noise should be added to the target average
seed=54321)
我会有错误喜欢
Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-6773422589366407956.py", line 331, in <module>
exec(code)
File "<stdin>", line 5, in <module>
File "/usr/lib/envs/env-1101-ver-1619-a-4.2.9-py-3.5.3/lib/python3.5/site-packages/h2o/targetencoder.py", line 97, in transform
assert self._encodingMap.map_keys['string'] == self._teColumns
AssertionError
我在它的源代码中找到了代码http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/_modules/h2o/targetencoder.html
问题是因为您正在尝试对多个分类特征进行编码。我认为这是 H2O 的一个错误,但您可以解决将转换器放在循环遍历所有类别名称的 for 循环中的问题。
import numpy as np
import pandas as pd
import h2o
from h2o.targetencoder import TargetEncoder
h2o.init()
df = pd.DataFrame({
'x_0': ['a'] * 5 + ['b'] * 5,
'x_1': ['c'] * 9 + ['d'] * 1,
'x_2': ['a'] * 3 + ['b'] * 7,
'y_0': [1, 1, 1, 1, 0, 1, 0, 0, 0, 0]
})
hf = h2o.H2OFrame(df)
hf['cv_fold_te'] = hf.kfold_column(n_folds=2, seed=54321)
hf['y_0'] = hf['y_0'].asfactor()
cat_features = ['x_0', 'x_1', 'x_2']
for item in cat_features:
target_encoder = TargetEncoder(x=[item], y='y_0', fold_column = 'cv_fold_te')
target_encoder.fit(hf)
hf = target_encoder.transform(frame=hf, holdout_type='kfold',
seed=54321, noise=0.0)
hf
感谢大家告诉我们。断言是一种预防措施,因为我不确定是否可以更改订单。其余代码是在考虑到这一假设的情况下编写的,因此无论如何都可以安全地使用更改后的顺序,但是断言被遗忘了。添加了测试并删除了断言。现在这个问题已修复并合并。应该在即将发布的修复版本中可用。 0xdata.atlassian.net/browse/PUBDEV-6474