有没有办法将多个逻辑回归方程合并为一个?
Is there a way to ensemble multiple logistic regression equations into one?
我正在处理响应率(不良)低于 1% 的二元分类问题。预测变量包括一组名义分类变量和连续变量。
最初,我尝试了一种过采样技术 (SMOTE) 来平衡两者 类。对过采样数据集执行逻辑回归可获得良好的整体准确性,但误报率非常高。
我现在正计划进行欠采样和 运行 多元逻辑回归模型。我正在处理的基本 python 代码如下。在将这些多元逻辑回归模型的结果整合为一个方面需要指导。
import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
#Set i for the number of equations required
for i in range(10):
#Create a sample of goods, good is pandas df containing goods
sample_good=good.sample(n=300,replace=True)
#Create a sample of bads, bad is pandas df containing bads. There are
#only 100 bads in the dataset
sample_bad=bad.sample(n=100,replace=True)
#Append the good and bad sample
sample=sample_good.append(sample_bad)
X = sample.loc[:, sample.columns != 'y']
y = sample.loc[:, sample.columns == 'y']
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=0)
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
print('Accuracy of logistic regression classifier on test set:
{:.2f}'.format(logreg.score(X_test, y_test)))
上面的for循环运行s 10次,构建了10个不同的模型。需要有关将这 10 个模型集成到一个模型中的指导。我已经阅读了有关装袋等可用技术的信息。在这种情况下,由于响应率非常低,我创建的示例每次都需要包含所有错误。
我认为你应该使用 scikit-learn 的 BaggingClassifier。简而言之,它将多个分类器拟合到数据的随机子样本上,然后让它们投票执行分类。这个元估计器将优雅地阻止您编写 for 循环。至于采样(我相信这是你编写循环的最初动机),你可以在 model.fit() 方法中调整权重。
import numpy as np
from sklearn.ensemble import BaggingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import balanced_accuracy_score
breast_cancer = datasets.load_breast_cancer()
X = breast_cancer.data
y = breast_cancer.target
X_train, X_test, y_train, y_test = train_test_split(X,y)
如你所见,数据集是不平衡的(毕竟是医疗数据):
len(y_train[y_train == 0]),len(y_train[y_train == 1]) # 163, 263
因此,让我们添加样本权重
y0 = len(y_train[y_train == 0])
y1 = len(y_train[y_train == 1])
w0 = y1/y0
w1 = 1
sample_weights = np.zeros(len(y_train))
sample_weights[y_train == 0] = w0
sample_weights[y_train == 1] = w1
现在是 BaggingClassifier:
model = BaggingClassifier(LogisticRegression(solver = 'liblinear'),
n_estimators=10,
bootstrap = True, random_state = 2019)
model.fit(X,y,sample_weights)
balanced_accuracy_score(y_test,model.predict(X_test)) # 94.2%
请注意,如果我不适合样本权重,我只能得到 92.1% 的平衡精度(平衡精度 = 平均召回率,这对于不平衡问题非常方便)
我正在处理响应率(不良)低于 1% 的二元分类问题。预测变量包括一组名义分类变量和连续变量。
最初,我尝试了一种过采样技术 (SMOTE) 来平衡两者 类。对过采样数据集执行逻辑回归可获得良好的整体准确性,但误报率非常高。
我现在正计划进行欠采样和 运行 多元逻辑回归模型。我正在处理的基本 python 代码如下。在将这些多元逻辑回归模型的结果整合为一个方面需要指导。
import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
#Set i for the number of equations required
for i in range(10):
#Create a sample of goods, good is pandas df containing goods
sample_good=good.sample(n=300,replace=True)
#Create a sample of bads, bad is pandas df containing bads. There are
#only 100 bads in the dataset
sample_bad=bad.sample(n=100,replace=True)
#Append the good and bad sample
sample=sample_good.append(sample_bad)
X = sample.loc[:, sample.columns != 'y']
y = sample.loc[:, sample.columns == 'y']
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=0)
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
print('Accuracy of logistic regression classifier on test set:
{:.2f}'.format(logreg.score(X_test, y_test)))
上面的for循环运行s 10次,构建了10个不同的模型。需要有关将这 10 个模型集成到一个模型中的指导。我已经阅读了有关装袋等可用技术的信息。在这种情况下,由于响应率非常低,我创建的示例每次都需要包含所有错误。
我认为你应该使用 scikit-learn 的 BaggingClassifier。简而言之,它将多个分类器拟合到数据的随机子样本上,然后让它们投票执行分类。这个元估计器将优雅地阻止您编写 for 循环。至于采样(我相信这是你编写循环的最初动机),你可以在 model.fit() 方法中调整权重。
import numpy as np
from sklearn.ensemble import BaggingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import balanced_accuracy_score
breast_cancer = datasets.load_breast_cancer()
X = breast_cancer.data
y = breast_cancer.target
X_train, X_test, y_train, y_test = train_test_split(X,y)
如你所见,数据集是不平衡的(毕竟是医疗数据):
len(y_train[y_train == 0]),len(y_train[y_train == 1]) # 163, 263
因此,让我们添加样本权重
y0 = len(y_train[y_train == 0])
y1 = len(y_train[y_train == 1])
w0 = y1/y0
w1 = 1
sample_weights = np.zeros(len(y_train))
sample_weights[y_train == 0] = w0
sample_weights[y_train == 1] = w1
现在是 BaggingClassifier:
model = BaggingClassifier(LogisticRegression(solver = 'liblinear'),
n_estimators=10,
bootstrap = True, random_state = 2019)
model.fit(X,y,sample_weights)
balanced_accuracy_score(y_test,model.predict(X_test)) # 94.2%
请注意,如果我不适合样本权重,我只能得到 92.1% 的平衡精度(平衡精度 = 平均召回率,这对于不平衡问题非常方便)