整理来自 sklearn 的多个测试训练分割的模型系数
Collate model coefficients across multiple test-train splits from sklearn
我想将来自多个(随机)测试训练拆分的 model/feature 系数合并到 python 中的单个数据帧中。
目前,我的方法是为每个测试列车一次生成模型系数,然后在代码末尾将它们组合起来。
虽然这可行,但过于冗长并且无法扩展到非常大量的测试训练拆分。
有人可以用一个简单的 for 循环来简化我的方法吗?我的不雅、过于冗长的代码如下:
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
####Instantiate logistic regression objects
log = LogisticRegression(class_weight='balanced', random_state = 1)
#### import some data
iris = datasets.load_iris()
X = pd.DataFrame(iris.data[:100, :], columns = ["sepal_length", "sepal_width", "petal_length", "petal_width"])
y = iris.target[:100,]
#####test_train split #1
train_x, test_x, train_y, test_y = train_test_split(X,y, stratify=y, test_size=0.3, random_state=11)
log.fit(train_x, train_y) #fit final model
pred_y = log.predict(test_x) #store final model predictions
probs_y = log.predict_proba(test_x) #final model class probabilities
coeff_final1 = pd.concat([pd.DataFrame(X.columns),pd.DataFrame(np.transpose(log.coef_))], axis = 1)
coeff_final1.columns=("features", "coefficients_1")
######test_train split #2
train_x, test_x, train_y, test_y = train_test_split(X,y, stratify=y, test_size=0.3, random_state=444)
log.fit(train_x, train_y) #fit final model
pred_y = log.predict(test_x) #store final model predictions
probs_y = log.predict_proba(test_x) #final model class probabilities
coeff_final2 = pd.concat([pd.DataFrame(X.columns),pd.DataFrame(np.transpose(log.coef_))], axis = 1)
coeff_final2.columns=("features", "coefficients_2")
#####test_train split #3
train_x, test_x, train_y, test_y = train_test_split(X,y, stratify=y, test_size=0.3, random_state=21)
log.fit(train_x, train_y) #fit final model
pred_y = log.predict(test_x) #store final model predictions
probs_y = log.predict_proba(test_x) #final model class probabilities
coeff_final3 = pd.concat([pd.DataFrame(X.columns),pd.DataFrame(np.transpose(log.coef_))], axis = 1)
coeff_final3.columns=("features", "coefficients_3")
#####test_train split #4
train_x, test_x, train_y, test_y = train_test_split(X,y, stratify=y, test_size=0.3, random_state=109)
log.fit(train_x, train_y) #fit final model
pred_y = log.predict(test_x) #store final model predictions
probs_y = log.predict_proba(test_x) #final model class probabilities
coeff_final4 = pd.concat([pd.DataFrame(X.columns),pd.DataFrame(np.transpose(log.coef_))], axis = 1)
coeff_final4.columns=("features", "coefficients_4")
#####test_train split #5
train_x, test_x, train_y, test_y = train_test_split(X,y, stratify=y, test_size=0.3, random_state=1900)
log.fit(train_x, train_y) #fit final model
pred_y = log.predict(test_x) #store final model predictions
probs_y = log.predict_proba(test_x) #final model class probabilities
coeff_final5 = pd.concat([pd.DataFrame(X.columns),pd.DataFrame(np.transpose(log.coef_))], axis = 1)
coeff_final5.columns=("features", "coefficients_5")
#######Append features/coefficients & odds ratios across 5 test-train splits
#append all coefficients into a single dataframe
coeff_table = pd.concat([coeff_final1, coeff_final2["coefficients_2"], coeff_final3["coefficients_3"],coeff_final4["coefficients_4"], coeff_final5["coefficients_5"] ], axis = 1)
#append mean and std error for each coefficient
coeff_table["mean_coeff"] = coeff_table.mean(axis = 1)
coeff_table["se_coeff"] = coeff_table[["features", "coefficients_1", "coefficients_2", "coefficients_3", "coefficients_4", "coefficients_5"]].sem(axis=1)
最终的 table 如下所示:
有人可以告诉我如何生成上述 table 而无需编写上面从测试训练拆分 #2 到测试训练拆分 #5 的所有代码行吗?
正如您提到的,您可以使用 for 循环执行此操作:
# start by creating the first features column
coeff_table = pd.DataFrame(X.columns, columns=["features"])
# iterate over random states while keeping track of `i`
for i, state in enumerate([11, 444, 21, 109, 1900]):
train_x, test_x, train_y, test_y = train_test_split(
X, y, stratify=y, test_size=0.3, random_state=state)
log.fit(train_x, train_y) #fit final model
coeff_table[f"coefficients_{i+1}"] = np.transpose(log.coef_)
请注意,我们正在放弃此循环中的 predict
和 predict_proba
调用,因为这些值将被丢弃(每次在您的代码中被覆盖),但是您可以使用类似的方法将它们添加回来在 table.
中创建新列的循环逻辑
我想将来自多个(随机)测试训练拆分的 model/feature 系数合并到 python 中的单个数据帧中。
目前,我的方法是为每个测试列车一次生成模型系数,然后在代码末尾将它们组合起来。
虽然这可行,但过于冗长并且无法扩展到非常大量的测试训练拆分。
有人可以用一个简单的 for 循环来简化我的方法吗?我的不雅、过于冗长的代码如下:
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
####Instantiate logistic regression objects
log = LogisticRegression(class_weight='balanced', random_state = 1)
#### import some data
iris = datasets.load_iris()
X = pd.DataFrame(iris.data[:100, :], columns = ["sepal_length", "sepal_width", "petal_length", "petal_width"])
y = iris.target[:100,]
#####test_train split #1
train_x, test_x, train_y, test_y = train_test_split(X,y, stratify=y, test_size=0.3, random_state=11)
log.fit(train_x, train_y) #fit final model
pred_y = log.predict(test_x) #store final model predictions
probs_y = log.predict_proba(test_x) #final model class probabilities
coeff_final1 = pd.concat([pd.DataFrame(X.columns),pd.DataFrame(np.transpose(log.coef_))], axis = 1)
coeff_final1.columns=("features", "coefficients_1")
######test_train split #2
train_x, test_x, train_y, test_y = train_test_split(X,y, stratify=y, test_size=0.3, random_state=444)
log.fit(train_x, train_y) #fit final model
pred_y = log.predict(test_x) #store final model predictions
probs_y = log.predict_proba(test_x) #final model class probabilities
coeff_final2 = pd.concat([pd.DataFrame(X.columns),pd.DataFrame(np.transpose(log.coef_))], axis = 1)
coeff_final2.columns=("features", "coefficients_2")
#####test_train split #3
train_x, test_x, train_y, test_y = train_test_split(X,y, stratify=y, test_size=0.3, random_state=21)
log.fit(train_x, train_y) #fit final model
pred_y = log.predict(test_x) #store final model predictions
probs_y = log.predict_proba(test_x) #final model class probabilities
coeff_final3 = pd.concat([pd.DataFrame(X.columns),pd.DataFrame(np.transpose(log.coef_))], axis = 1)
coeff_final3.columns=("features", "coefficients_3")
#####test_train split #4
train_x, test_x, train_y, test_y = train_test_split(X,y, stratify=y, test_size=0.3, random_state=109)
log.fit(train_x, train_y) #fit final model
pred_y = log.predict(test_x) #store final model predictions
probs_y = log.predict_proba(test_x) #final model class probabilities
coeff_final4 = pd.concat([pd.DataFrame(X.columns),pd.DataFrame(np.transpose(log.coef_))], axis = 1)
coeff_final4.columns=("features", "coefficients_4")
#####test_train split #5
train_x, test_x, train_y, test_y = train_test_split(X,y, stratify=y, test_size=0.3, random_state=1900)
log.fit(train_x, train_y) #fit final model
pred_y = log.predict(test_x) #store final model predictions
probs_y = log.predict_proba(test_x) #final model class probabilities
coeff_final5 = pd.concat([pd.DataFrame(X.columns),pd.DataFrame(np.transpose(log.coef_))], axis = 1)
coeff_final5.columns=("features", "coefficients_5")
#######Append features/coefficients & odds ratios across 5 test-train splits
#append all coefficients into a single dataframe
coeff_table = pd.concat([coeff_final1, coeff_final2["coefficients_2"], coeff_final3["coefficients_3"],coeff_final4["coefficients_4"], coeff_final5["coefficients_5"] ], axis = 1)
#append mean and std error for each coefficient
coeff_table["mean_coeff"] = coeff_table.mean(axis = 1)
coeff_table["se_coeff"] = coeff_table[["features", "coefficients_1", "coefficients_2", "coefficients_3", "coefficients_4", "coefficients_5"]].sem(axis=1)
最终的 table 如下所示:
有人可以告诉我如何生成上述 table 而无需编写上面从测试训练拆分 #2 到测试训练拆分 #5 的所有代码行吗?
正如您提到的,您可以使用 for 循环执行此操作:
# start by creating the first features column
coeff_table = pd.DataFrame(X.columns, columns=["features"])
# iterate over random states while keeping track of `i`
for i, state in enumerate([11, 444, 21, 109, 1900]):
train_x, test_x, train_y, test_y = train_test_split(
X, y, stratify=y, test_size=0.3, random_state=state)
log.fit(train_x, train_y) #fit final model
coeff_table[f"coefficients_{i+1}"] = np.transpose(log.coef_)
请注意,我们正在放弃此循环中的 predict
和 predict_proba
调用,因为这些值将被丢弃(每次在您的代码中被覆盖),但是您可以使用类似的方法将它们添加回来在 table.