使用多标签文本分类中的所有标签进行预测

Question

我目前正在处理一个多标签文本分类问题，其中我有 4 个标签，表示为 4 个虚拟变量。我已经尝试了几种方法来以适合制作 MLC 的方式转换数据。

现在我是运行管道，但据我所知，这不适合包含所有标签的模型，而是每个标签制作 1 个模型 - 你同意吗？

我尝试使用 MultiLabelBinarizer 和 LabelBinarizer，但没有成功。

关于如何解决这个问题，您有什么建议可以让模型在一个模型中包含所有标签，同时考虑到不同的标签组合吗？

数据的一个子集和我的代码在这里：

import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

# Import data
df  = import_data("product_data")
# Define dataframe to only include relevant columns
df = df.loc[:,['text','TV','Internet','Mobil','Fastnet']]
# Define dataframe with labels
df_labels = df.loc[:,['TV','Internet','Mobil','Fastnet']]
# Sum the number of labels per text
sum_column = df["TV"] + df["Internet"] + df["Mobil"] + df["Fastnet"]
df["label_sum"] = sum_column
# Remove texts with no labels
df.drop(df[df['label_sum'] == 0].index, inplace = True)
# Split dataset
train, test = train_test_split(df, random_state=42, test_size=0.2, shuffle=True)
X_train = train.text
X_test = test.text

categories = ['TV','Internet','Mobil','Fastnet']

# Model
LogReg_pipeline = Pipeline([
                ('tfidf', TfidfVectorizer(analyzer = 'word', max_df=0.20)),
                ('clf', LogisticRegression(solver='lbfgs', multi_class = 'ovr', class_weight = 'balanced', n_jobs=-1)),
                 ])
    
for category in categories:
    print('... Processing {}'.format(category))
    LogReg_pipeline.fit(X_train, train[category])
    prediction = LogReg_pipeline.predict(X_test)
    print('Test accuracy is {}'.format(accuracy_score(test[category], prediction)))

https://www.transfernow.net/dl/20210921NbWDt3eo

Answer 1

代码分析

使用 OVR（one-vs-rest）的 scikit-learn LogisticRegression classifier 一次只能预测一个 output/label。由于您一次在多个标签上训练管道中的模型，因此您将为每个标签生成一个经过训练的模型。算法本身对于所有模型都是相同的，但你会以不同的方式训练它们。

多输出回归量

多输出回归器可以接受多个独立标签并为每个目标生成一个预测。
输出应该和你的一样，但你只需要维护一个模型并训练一次。
要使用此方法，请将 LR 模型包装在 MultiOutputRegressor.
Here 是关于多输出回归模型的很好的教程。

model = LogisticRegression(solver='lbfgs', multi_class='ovr', class_weight='balanced', n_jobs=-1)

pipeline = Pipeline([
                ('tfidf', TfidfVectorizer(analyzer = 'word', max_df=0.20)),
                ('clf', MultiOutputRegressor(model))])

preds = pipeline.fit(X_train, df_labels).predict(X_test)
df_preds = combine_data(X=X_test, Y=preds, y_cols=categories)

combine_data() 为方便起见将所有数据合并到一个 DataFrame 中：

def combine_data(X, Y, y_cols):
    """ X is a dataframe, Y is a np array, y_cols is a list """
    df_out = pd.DataFrame(Y, columns=y_cols)
    df_out.index = X.index
    return pd.concat([X, df_out], axis=1).sort_index()

多项逻辑回归

要一次在所有标签上使用 LogisticRegression classifier，请设置 multi_class=multinomial.
softmax 函数用于查找样本属于 class.
您需要反转标签上的单热编码以取回分类变量（此处）。如果您在 one-hot 编码之前有原始标签，请使用它。
Here 是关于多项逻辑回归的一个很好的教程。

label_col=["text_source"]
clf = LogisticRegression(multi_class='multinomial', solver='lbfgs')
model = clf.fit(df_train[input_cols], df_train[label_col])

# Generate a table of probabilities for each class
probs = model.predict_proba(X_test)
df_probs = combine_data(X=X_test, Y=probs, y_cols=label_col)

# Predict the class for a sample, i.e. the one with the highest probability
preds = model.predict(X_test)
df_preds = combine_data(X=X_test, Y=preds, y_cols=label_col)

使用多标签文本分类中的所有标签进行预测

Making predictions using all labels in multilabel text classification

python

scikit-learn

multilabel-classification