计算 LogisticRegression 模型的 AUC
Calculating AUC for LogisticRegression model
我们来获取数据
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.decomposition import PCA
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
data = load_breast_cancer()
X = data.data
y = data.target
我想仅使用第一主成分创建模型并为其计算 AUC。
我目前的工作
scaler = StandardScaler()
scaler.fit(X_train)
X_scaled = scaler.transform(X)
pca = PCA(n_components=1)
principalComponents = pca.fit_transform(X)
principalDf = pd.DataFrame(data = principalComponents
, columns = ['principal component 1'])
clf = LogisticRegression()
clf = clf.fit(principalDf, y)
pred = clf.predict_proba(principalDf)
但是当我尝试使用
fpr, tpr, thresholds = metrics.roc_curve(y, pred, pos_label=2)
出现以下错误:
y should be a 1d array, got an array of shape (569, 2) instead.
我试图重塑我的数据
fpr, tpr, thresholds = metrics.roc_curve(y.reshape(1,-1), pred, pos_label=2)
但是没有解决问题(输出):
multilabel-indicator format is not supported
您知道如何在第一个主成分上执行 AUC 吗?
您不妨试试:
from sklearn.datasets import load_breast_cancer
from sklearn.decomposition import PCA
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
X,y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X,y)
scaler = StandardScaler()
pca = PCA(2)
clf = LogisticRegression()
ppl = Pipeline([("scaler",scaler),("pca",pca),("clf",clf)])
ppl.fit(X_train, y_train)
preds = ppl.predict(X_test)
fpr, tpr, thresholds = metrics.roc_curve(y_test, preds, pos_label=1)
metrics.plot_roc_curve(ppl, X_test, y_test)
问题是 predict_proba
returns 每个 class 一列。一般用二进制class化,你的class是0和1,所以你要的是第二个class的概率,所以很常见的切片如下(替换最后一行你的代码块):
pred = clf.predict_proba(principalDf)[:, 1]
我们来获取数据
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.decomposition import PCA
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
data = load_breast_cancer()
X = data.data
y = data.target
我想仅使用第一主成分创建模型并为其计算 AUC。
我目前的工作
scaler = StandardScaler()
scaler.fit(X_train)
X_scaled = scaler.transform(X)
pca = PCA(n_components=1)
principalComponents = pca.fit_transform(X)
principalDf = pd.DataFrame(data = principalComponents
, columns = ['principal component 1'])
clf = LogisticRegression()
clf = clf.fit(principalDf, y)
pred = clf.predict_proba(principalDf)
但是当我尝试使用
fpr, tpr, thresholds = metrics.roc_curve(y, pred, pos_label=2)
出现以下错误:
y should be a 1d array, got an array of shape (569, 2) instead.
我试图重塑我的数据
fpr, tpr, thresholds = metrics.roc_curve(y.reshape(1,-1), pred, pos_label=2)
但是没有解决问题(输出):
multilabel-indicator format is not supported
您知道如何在第一个主成分上执行 AUC 吗?
您不妨试试:
from sklearn.datasets import load_breast_cancer
from sklearn.decomposition import PCA
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
X,y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X,y)
scaler = StandardScaler()
pca = PCA(2)
clf = LogisticRegression()
ppl = Pipeline([("scaler",scaler),("pca",pca),("clf",clf)])
ppl.fit(X_train, y_train)
preds = ppl.predict(X_test)
fpr, tpr, thresholds = metrics.roc_curve(y_test, preds, pos_label=1)
metrics.plot_roc_curve(ppl, X_test, y_test)
问题是 predict_proba
returns 每个 class 一列。一般用二进制class化,你的class是0和1,所以你要的是第二个class的概率,所以很常见的切片如下(替换最后一行你的代码块):
pred = clf.predict_proba(principalDf)[:, 1]