如何从 1 个独立列预测多个相关列

how to predict multiple dependent columns from 1 independent column

是否可以从独立列预测多个依赖列?

问题陈述:我必须根据 STATUS 列预测 5 个因素(cEXT、cNEU、cAGR、cCON、cOPN),因此输入变量将仅为 STATUS 列目标变量是 (cEXT, cNEU,cAGR, cCON, cOPN)。

在上面的数据中,STATUS 是一个独立的列,cEXT、cNEU、cAGR、cCON、cOPN 是从属列,我如何预测它们?

# independent and dependent variable split
X = df[['STATUS']]
y = df[["cEXT","cNEU","cAGR","cCON","cOPN"]]

现在我只预测一列,所以重复同样的事情 5 次,所以我正在为 5 个目标变量创建 5 个模型。

代码:

X = df[['STATUS']]
y = df[["cEXT","cNEU","cAGR","cCON","cOPN"]]

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=5)


from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

ct = ColumnTransformer([
    ('step1', TfidfVectorizer(), 'STATUS')
],remainder='drop')

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, recall_score, classification_report, cohen_kappa_score
from sklearn import metrics 
from sklearn.pipeline import Pipeline

# ########## 
# RandomForest
# ##########
model = Pipeline([
        ('column_transformers', ct),
        ('model', RandomForestClassifier(criterion = 'gini', n_estimators=100, n_jobs = -1, class_weight = 'balanced', max_features = 'auto')),
    ])

# creating 5 models, can I create 1 model?
model_cEXT = model.fit(X_train, y_train['cEXT'])
model_cNEU = model.fit(X_train, y_train['cNEU'])
model_cAGR = model.fit(X_train, y_train['cAGR'])
model_cCON = model.fit(X_train, y_train['cCON'])
model_cOPN = model.fit(X_train, y_train['cOPN'])

您可以使用 scikit-learn 中的多输出分类器。

from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier
clf = MultiOutputClassifier(RandomForestClassifier()).fit(X_train, y_train)
clf.predict(X_test)

参考: Official document of MultiOutputClassifier

有一个库 scikit-multilearn 非常适合这些任务。有几种方法可以进行多标签分类,例如 PowerSetClassifierChain 等。这些在这个库中都有很好的介绍。

下面是一个示例,说明它将如何替换您当前的代码。

X = df[['STATUS']]
y = df[["cEXT","cNEU","cAGR","cCON","cOPN"]]
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=5)

# Rest of your code
==========================
# The new code

from skmultilearn.problem_transform import BinaryRelevance
from scipy.sparse import csr_matrix



classifier = BinaryRelevance(
    classifier = RandomForestClassifier(criterion = 'gini', n_estimators=100, n_jobs = -1, class_weight = 'balanced', max_features = 'auto'),
    require_dense = [False, True]
)

model = Pipeline([
        ('column_transformers', ct),
        ('classifier', classifier),
    ])

model.fit(X_train, y_train.values)
res = model.predict(X_test)
res = csr_matrix(res)
res.todense()

您可以探索其他方法here

在 TensorFlow 中,您可以在所有单元上使用 sigmoid 激活和 binaryCE 损失来执行此操作。如下:

import tensorflow as tf
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

tfidf_calculator = TextVectorization(
                  standardize = 'lower_and_strip_punctuation',
                  split       = 'whitespace',
                  max_tokens  = 100,
                  output_mode ='tf-idf',
                  pad_to_max_tokens=False)

tfidf_calculator.adapt(df['Status'].values)

tfids = tfidf_calculator(df['Status'])

X = tfids.numpy()
y = df[["cEXT","cNEU","cAGR","cCON","cOPN"]].values

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=5)

model = tf.keras.Sequential([
    tf.keras.layers.InputLayer(input_shape=(100,)),
    tf.keras.layers.Dense(10, activation='relu'),
    tf.keras.layers.Dense(5, activation='sigmoid')
])

model.compile(optimizer='adam', loss=tf.keras.losses.BinaryCrossentropy())

model.fit(X_train, y_train, epochs=20, batch_size=32)

在 TensorFlow 中需要注意的一点是,您需要一个密集矩阵作为输入。可能有一种使用稀疏的方法,但我没有找到。