组合随机森林树时出现意外异常
Unexpected exception when combining random forest trees
使用这个问题中描述的信息,
,我试图使用 python2.7.10 和 sklearn 0.16.1 将几个随机森林分类器组合成一个分类器,但在某些情况下出现此异常:
Traceback (most recent call last):
File "sktest.py", line 50, in <module>
predict(rf)
File "sktest.py", line 46, in predict
Y = rf.predict(X)
File "/python-2.7.10/lib/python2.7/site-packages/sklearn/ensemble/forest.py", line 462, in predict
proba = self.predict_proba(X)
File "/python-2.7.10/lib/python2.7/site-packages/sklearn/ensemble/forest.py", line 520, in predict_proba
proba += all_proba[j]
ValueError: non-broadcastable output operand with shape (39,1) doesn't match the broadcast shape (39,2)
该应用程序是在多个处理器上创建多个随机森林分类器,并将这些对象组合成一个可供所有处理器使用的分类器。
产生此异常的测试代码如下所示,它创建了 5 个分类器,其中包含 10 个特征的随机数组。如果把yfrac改成0.5,代码不会报异常。这是组合分类器对象的有效方法吗?此外,当增加 n_estimators 并通过拟合添加数据时,使用 warm_start 将树添加到现有的 RandomForestClassifier 时会创建相同的异常。
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import train_test_split
from numpy import zeros,random,logical_or,where,array
random.seed(1)
def generate_rf(X_train, y_train, X_test, y_test, numTrees=50):
rf = RandomForestClassifier(n_estimators=numTrees, n_jobs=-1)
rf.fit(X_train, y_train)
print "rf score ", rf.score(X_test, y_test)
return rf
def combine_rfs(rf_a, rf_b):
rf_a.estimators_ += rf_b.estimators_
rf_a.n_estimators = len(rf_a.estimators_)
return rf_a
def make_data(ndata, yfrac=0.5):
nx = int(random.uniform(10,100))
X = zeros((nx,ndata))
Y = zeros(nx)
for n in range(ndata):
rnA = random.random()*10**(random.random()*5)
X[:,n] = random.uniform(-rnA,rnA, nx)
Y = logical_or(Y,where(X[:,n] > yfrac*rnA, 1.,0.))
return X, Y
def train(ntrain=5, ndata=10, test_frac=0.2, yfrac=0.5):
rfs = []
for u in range(ntrain):
X, Y = make_data(ndata, yfrac=yfrac)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_frac)
#Train the random forest and add to list
rfs.append(generate_rf(X_train, Y_train, X_test, Y_test))
# Combine the block classifiers into a single classifier
return reduce(combine_rfs, rfs)
def predict(rf, ndata=10):
X, Y = make_data(ndata)
Y = rf.predict(X)
if __name__ == "__main__":
rf = train(yfrac = 0.42)
predict(rf)
您的第一个 RandomForest 仅获得阳性案例,而其他 RandomForest 获得两种案例。结果,他们的 DecisionTree 结果彼此不兼容。 运行 您的代码替换了 train() 函数:
def train(ntrain=5, ndata=10, test_frac=0.2, yfrac=0.5):
rfs = []
for u in range(ntrain):
X, Y = make_data(ndata, yfrac=yfrac)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_frac)
assert Y_train.sum() != 0
assert Y_train.sum() != len( Y_train )
#Train the random forest and add to list
rfs.append(generate_rf(X_train, Y_train, X_test, Y_test))
# Combine the block classifiers into a single classifier
return reduce(combine_rfs, rfs)
使用 StratifiedShuffleSplit 交叉验证生成器而不是 train_test_split,并检查以确保每个 RF 在训练集中都(全部)类。
使用这个问题中描述的信息,
Traceback (most recent call last):
File "sktest.py", line 50, in <module>
predict(rf)
File "sktest.py", line 46, in predict
Y = rf.predict(X)
File "/python-2.7.10/lib/python2.7/site-packages/sklearn/ensemble/forest.py", line 462, in predict
proba = self.predict_proba(X)
File "/python-2.7.10/lib/python2.7/site-packages/sklearn/ensemble/forest.py", line 520, in predict_proba
proba += all_proba[j]
ValueError: non-broadcastable output operand with shape (39,1) doesn't match the broadcast shape (39,2)
该应用程序是在多个处理器上创建多个随机森林分类器,并将这些对象组合成一个可供所有处理器使用的分类器。
产生此异常的测试代码如下所示,它创建了 5 个分类器,其中包含 10 个特征的随机数组。如果把yfrac改成0.5,代码不会报异常。这是组合分类器对象的有效方法吗?此外,当增加 n_estimators 并通过拟合添加数据时,使用 warm_start 将树添加到现有的 RandomForestClassifier 时会创建相同的异常。
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import train_test_split
from numpy import zeros,random,logical_or,where,array
random.seed(1)
def generate_rf(X_train, y_train, X_test, y_test, numTrees=50):
rf = RandomForestClassifier(n_estimators=numTrees, n_jobs=-1)
rf.fit(X_train, y_train)
print "rf score ", rf.score(X_test, y_test)
return rf
def combine_rfs(rf_a, rf_b):
rf_a.estimators_ += rf_b.estimators_
rf_a.n_estimators = len(rf_a.estimators_)
return rf_a
def make_data(ndata, yfrac=0.5):
nx = int(random.uniform(10,100))
X = zeros((nx,ndata))
Y = zeros(nx)
for n in range(ndata):
rnA = random.random()*10**(random.random()*5)
X[:,n] = random.uniform(-rnA,rnA, nx)
Y = logical_or(Y,where(X[:,n] > yfrac*rnA, 1.,0.))
return X, Y
def train(ntrain=5, ndata=10, test_frac=0.2, yfrac=0.5):
rfs = []
for u in range(ntrain):
X, Y = make_data(ndata, yfrac=yfrac)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_frac)
#Train the random forest and add to list
rfs.append(generate_rf(X_train, Y_train, X_test, Y_test))
# Combine the block classifiers into a single classifier
return reduce(combine_rfs, rfs)
def predict(rf, ndata=10):
X, Y = make_data(ndata)
Y = rf.predict(X)
if __name__ == "__main__":
rf = train(yfrac = 0.42)
predict(rf)
您的第一个 RandomForest 仅获得阳性案例,而其他 RandomForest 获得两种案例。结果,他们的 DecisionTree 结果彼此不兼容。 运行 您的代码替换了 train() 函数:
def train(ntrain=5, ndata=10, test_frac=0.2, yfrac=0.5):
rfs = []
for u in range(ntrain):
X, Y = make_data(ndata, yfrac=yfrac)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_frac)
assert Y_train.sum() != 0
assert Y_train.sum() != len( Y_train )
#Train the random forest and add to list
rfs.append(generate_rf(X_train, Y_train, X_test, Y_test))
# Combine the block classifiers into a single classifier
return reduce(combine_rfs, rfs)
使用 StratifiedShuffleSplit 交叉验证生成器而不是 train_test_split,并检查以确保每个 RF 在训练集中都(全部)类。