如何在 Python 中以分层方式从已预测的 class 集群中预测 subclass

How to predict subclass from a cluster of already predicted class in a hierarchical manner in Python

假设我有以下数据框:

     Student_Id  Math  Physical  Arts Class Sub_Class
0        id_1     6         7     9     A         x
1        id_2     9         7     1     A         y
2        id_3     3         5     5     C         x
3        id_4     6         8     9     A         x
4        id_5     6         7    10     B         z
5        id_6     9         5    10     B         z
6        id_7     3         5     6     C         x
7        id_8     3         4     6     C         x
8        id_9     6         8     9     A         x
9       id_10     6         7    10     B         z
10      id_11     9         5    10     B         z
11      id_12     3         5     6     C         x
12      id_13     3         4     6     C         x

我想使用 RandomForestClassifier classifier 首先训练 class 作为目标变量并预测 class 在测试数据集中。

    Student_Id Class Sub_Class predicted_class
11      id_12     C         x               C
8        id_9     A         x               A
3        id_4     A         x               A

然后它采用测试数据集中的每个 predicted_class 并仅为 class 的特定组训练训练数据集并预测 Sub_Class 逐个附加每个组。

  1. 首先需要一个 class 'C' 并仅在 Class 'C' 上训练并预测子 class
   Student_Id Class Sub_Class predicted_class preicted_Sub_Class
11      id_12     C         x               C    x

2) 接下来它将采用 class 'A' 并仅在 Class 'A' 上训练并预测子 class

   Student_Id Class Sub_Class predicted_class preicted_Sub_Class
8        id_9     A         x               A    x
3        id_4     A         x               A    y

3)最后它会将它们全部分组 2) 接下来它将采用 class 'A' 并仅在 Class 'A' 上训练并预测子 class

   Student_Id Class Sub_Class predicted_class preicted_Sub_Class
11      id_12     C         x               C    x
8        id_9     A         x               A    x
3        id_4     A         x               A    y

总结,我不想单独训练和预测class/Sub_class。我想首先预测 class 使用该预测来训练模型 class 明智地作为一个集群和预测 'Sub_Class' 因为我认为这会更好的结果。

我无法理解我可以做 运行 的第二部分,每个 class 的循环和训练模型以获得 Sub_Class.

目前没有第二部分的示例代码


import pandas as pd

from sklearn.metrics import classification_report
from sklearn import metrics 
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn import metrics 
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

#Ceate dataframe
data = [
    ["id_1",6,7,9, "A", "x"],
    ["id_2",9,7,1, "A","y" ],
    ["id_3",3,5,5, "C", "x"],
    ["id_4",6,8,9, "A","x" ],
    ["id_5",6,7,10, "B", "z"],
    ["id_6",9,5,10,"B", "z"],
    ["id_7",3,5,6, "C", "x"],
    ["id_8",3,4,6, "C", "x"],
    ["id_9",6,8,9, "A","x" ],
    ["id_10",6,7,10, "B", "z"],
    ["id_11",9,5,10,"B", "z"],
    ["id_12",3,5,6, "C", "x"],
    ["id_13",3,4,6, "C", "x"]
]



df = pd.DataFrame(data, columns = ['Student_Id', 'Math', 'Physical','Arts', 'Class', 'Sub_Class'])


#Split into test and train
training_data, testing_data = train_test_split(df, test_size=0.2, random_state=25)


# First predict(classify) the Class--------------------------------------------

#Create train data
X_train = training_data[['Math', 'Physical','Arts']]

y_train = training_data[['Class']]

#Create test
X_test = testing_data[['Math', 'Physical','Arts']]

y_test = testing_data[['Class']]

#Ranom Forest classifier for  predicting class 
rfc = RandomForestClassifier(n_estimators=50).fit(X_train, y_train)
 
predictions = rfc.predict(X_test)

rfc_table = testing_data[['Student_Id', 'Class', 'Sub_Class']]
rfc_table = rfc_table.assign(predicted_class=predictions)

#Next train for Sub_Class------------------------------------------------------

你可以这样做

# we create a train function which takes a df and return the predicted sub_class on it
def train_sub(df):
    # A model dictionary to return the trained models
    models = {}

    # Now we will select all the unique classes in df and iterate over them
    for i in df['Class'].unique():

        # choose the index from df where the class is equal to i
        temp_idx = df[df['Class'] == i].index
        train_idx, test_idx = train_test_split(temp_idx, test_size=0.2, random_state=25)

        X_train = df.loc[train_idx, ['Math', 'Physical','Arts']]
        y_train = df.loc[train_idx, ['Sub_Class']]
        X_test = df.loc[test_idx, ['Math', 'Physical','Arts']]
        y_test = df.loc[test_idx, ['Sub_Class']]
        
        # Train the model to classify sub-class under that class
        temp_model = RandomForestClassifier(n_estimators=50).fit(X_train, y_train)        
        # Add the predicted values on whole df to df which the corresponding class
        df.loc[temp_idx, 'Predicted_subClass'] = temp_model.predict(df.loc[temp_idx, ['Math', 'Physical','Arts']])
        # add the model to dictionary
        models[i] = temp_model
    return models

# call the functions
models = train_sub(df)

# See the results
df