如何在 Python 中以分层方式从已预测的 class 集群中预测 subclass
How to predict subclass from a cluster of already predicted class in a hierarchical manner in Python
假设我有以下数据框:
Student_Id Math Physical Arts Class Sub_Class
0 id_1 6 7 9 A x
1 id_2 9 7 1 A y
2 id_3 3 5 5 C x
3 id_4 6 8 9 A x
4 id_5 6 7 10 B z
5 id_6 9 5 10 B z
6 id_7 3 5 6 C x
7 id_8 3 4 6 C x
8 id_9 6 8 9 A x
9 id_10 6 7 10 B z
10 id_11 9 5 10 B z
11 id_12 3 5 6 C x
12 id_13 3 4 6 C x
我想使用 RandomForestClassifier classifier 首先训练 class 作为目标变量并预测 class 在测试数据集中。
Student_Id Class Sub_Class predicted_class
11 id_12 C x C
8 id_9 A x A
3 id_4 A x A
然后它采用测试数据集中的每个 predicted_class 并仅为 class 的特定组训练训练数据集并预测 Sub_Class 逐个附加每个组。
- 首先需要一个 class 'C' 并仅在 Class 'C' 上训练并预测子 class
Student_Id Class Sub_Class predicted_class preicted_Sub_Class
11 id_12 C x C x
2) 接下来它将采用 class 'A' 并仅在 Class 'A' 上训练并预测子 class
Student_Id Class Sub_Class predicted_class preicted_Sub_Class
8 id_9 A x A x
3 id_4 A x A y
3)最后它会将它们全部分组
2) 接下来它将采用 class 'A' 并仅在 Class 'A' 上训练并预测子 class
Student_Id Class Sub_Class predicted_class preicted_Sub_Class
11 id_12 C x C x
8 id_9 A x A x
3 id_4 A x A y
总结,我不想单独训练和预测class/Sub_class。我想首先预测 class 使用该预测来训练模型 class 明智地作为一个集群和预测 'Sub_Class' 因为我认为这会更好的结果。
我无法理解我可以做 运行 的第二部分,每个 class 的循环和训练模型以获得 Sub_Class.
目前没有第二部分的示例代码
import pandas as pd
from sklearn.metrics import classification_report
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn import metrics
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
#Ceate dataframe
data = [
["id_1",6,7,9, "A", "x"],
["id_2",9,7,1, "A","y" ],
["id_3",3,5,5, "C", "x"],
["id_4",6,8,9, "A","x" ],
["id_5",6,7,10, "B", "z"],
["id_6",9,5,10,"B", "z"],
["id_7",3,5,6, "C", "x"],
["id_8",3,4,6, "C", "x"],
["id_9",6,8,9, "A","x" ],
["id_10",6,7,10, "B", "z"],
["id_11",9,5,10,"B", "z"],
["id_12",3,5,6, "C", "x"],
["id_13",3,4,6, "C", "x"]
]
df = pd.DataFrame(data, columns = ['Student_Id', 'Math', 'Physical','Arts', 'Class', 'Sub_Class'])
#Split into test and train
training_data, testing_data = train_test_split(df, test_size=0.2, random_state=25)
# First predict(classify) the Class--------------------------------------------
#Create train data
X_train = training_data[['Math', 'Physical','Arts']]
y_train = training_data[['Class']]
#Create test
X_test = testing_data[['Math', 'Physical','Arts']]
y_test = testing_data[['Class']]
#Ranom Forest classifier for predicting class
rfc = RandomForestClassifier(n_estimators=50).fit(X_train, y_train)
predictions = rfc.predict(X_test)
rfc_table = testing_data[['Student_Id', 'Class', 'Sub_Class']]
rfc_table = rfc_table.assign(predicted_class=predictions)
#Next train for Sub_Class------------------------------------------------------
你可以这样做
# we create a train function which takes a df and return the predicted sub_class on it
def train_sub(df):
# A model dictionary to return the trained models
models = {}
# Now we will select all the unique classes in df and iterate over them
for i in df['Class'].unique():
# choose the index from df where the class is equal to i
temp_idx = df[df['Class'] == i].index
train_idx, test_idx = train_test_split(temp_idx, test_size=0.2, random_state=25)
X_train = df.loc[train_idx, ['Math', 'Physical','Arts']]
y_train = df.loc[train_idx, ['Sub_Class']]
X_test = df.loc[test_idx, ['Math', 'Physical','Arts']]
y_test = df.loc[test_idx, ['Sub_Class']]
# Train the model to classify sub-class under that class
temp_model = RandomForestClassifier(n_estimators=50).fit(X_train, y_train)
# Add the predicted values on whole df to df which the corresponding class
df.loc[temp_idx, 'Predicted_subClass'] = temp_model.predict(df.loc[temp_idx, ['Math', 'Physical','Arts']])
# add the model to dictionary
models[i] = temp_model
return models
# call the functions
models = train_sub(df)
# See the results
df
假设我有以下数据框:
Student_Id Math Physical Arts Class Sub_Class
0 id_1 6 7 9 A x
1 id_2 9 7 1 A y
2 id_3 3 5 5 C x
3 id_4 6 8 9 A x
4 id_5 6 7 10 B z
5 id_6 9 5 10 B z
6 id_7 3 5 6 C x
7 id_8 3 4 6 C x
8 id_9 6 8 9 A x
9 id_10 6 7 10 B z
10 id_11 9 5 10 B z
11 id_12 3 5 6 C x
12 id_13 3 4 6 C x
我想使用 RandomForestClassifier classifier 首先训练 class 作为目标变量并预测 class 在测试数据集中。
Student_Id Class Sub_Class predicted_class
11 id_12 C x C
8 id_9 A x A
3 id_4 A x A
然后它采用测试数据集中的每个 predicted_class 并仅为 class 的特定组训练训练数据集并预测 Sub_Class 逐个附加每个组。
- 首先需要一个 class 'C' 并仅在 Class 'C' 上训练并预测子 class
Student_Id Class Sub_Class predicted_class preicted_Sub_Class
11 id_12 C x C x
2) 接下来它将采用 class 'A' 并仅在 Class 'A' 上训练并预测子 class
Student_Id Class Sub_Class predicted_class preicted_Sub_Class
8 id_9 A x A x
3 id_4 A x A y
3)最后它会将它们全部分组 2) 接下来它将采用 class 'A' 并仅在 Class 'A' 上训练并预测子 class
Student_Id Class Sub_Class predicted_class preicted_Sub_Class
11 id_12 C x C x
8 id_9 A x A x
3 id_4 A x A y
总结,我不想单独训练和预测class/Sub_class。我想首先预测 class 使用该预测来训练模型 class 明智地作为一个集群和预测 'Sub_Class' 因为我认为这会更好的结果。
我无法理解我可以做 运行 的第二部分,每个 class 的循环和训练模型以获得 Sub_Class.
目前没有第二部分的示例代码
import pandas as pd
from sklearn.metrics import classification_report
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn import metrics
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
#Ceate dataframe
data = [
["id_1",6,7,9, "A", "x"],
["id_2",9,7,1, "A","y" ],
["id_3",3,5,5, "C", "x"],
["id_4",6,8,9, "A","x" ],
["id_5",6,7,10, "B", "z"],
["id_6",9,5,10,"B", "z"],
["id_7",3,5,6, "C", "x"],
["id_8",3,4,6, "C", "x"],
["id_9",6,8,9, "A","x" ],
["id_10",6,7,10, "B", "z"],
["id_11",9,5,10,"B", "z"],
["id_12",3,5,6, "C", "x"],
["id_13",3,4,6, "C", "x"]
]
df = pd.DataFrame(data, columns = ['Student_Id', 'Math', 'Physical','Arts', 'Class', 'Sub_Class'])
#Split into test and train
training_data, testing_data = train_test_split(df, test_size=0.2, random_state=25)
# First predict(classify) the Class--------------------------------------------
#Create train data
X_train = training_data[['Math', 'Physical','Arts']]
y_train = training_data[['Class']]
#Create test
X_test = testing_data[['Math', 'Physical','Arts']]
y_test = testing_data[['Class']]
#Ranom Forest classifier for predicting class
rfc = RandomForestClassifier(n_estimators=50).fit(X_train, y_train)
predictions = rfc.predict(X_test)
rfc_table = testing_data[['Student_Id', 'Class', 'Sub_Class']]
rfc_table = rfc_table.assign(predicted_class=predictions)
#Next train for Sub_Class------------------------------------------------------
你可以这样做
# we create a train function which takes a df and return the predicted sub_class on it
def train_sub(df):
# A model dictionary to return the trained models
models = {}
# Now we will select all the unique classes in df and iterate over them
for i in df['Class'].unique():
# choose the index from df where the class is equal to i
temp_idx = df[df['Class'] == i].index
train_idx, test_idx = train_test_split(temp_idx, test_size=0.2, random_state=25)
X_train = df.loc[train_idx, ['Math', 'Physical','Arts']]
y_train = df.loc[train_idx, ['Sub_Class']]
X_test = df.loc[test_idx, ['Math', 'Physical','Arts']]
y_test = df.loc[test_idx, ['Sub_Class']]
# Train the model to classify sub-class under that class
temp_model = RandomForestClassifier(n_estimators=50).fit(X_train, y_train)
# Add the predicted values on whole df to df which the corresponding class
df.loc[temp_idx, 'Predicted_subClass'] = temp_model.predict(df.loc[temp_idx, ['Math', 'Physical','Arts']])
# add the model to dictionary
models[i] = temp_model
return models
# call the functions
models = train_sub(df)
# See the results
df