RandomOverSampler 是否导致我的模型过度拟合？

Question

我想看看我能class使用 Tfidf 矢量化根据类型对书籍进行分类。我正在使用五个适度不平衡的流派标签，我想使用多标签 classification 为每个文档分配一个或多个流派。最初我的表现中等，所以我尝试通过使用 RandomOverSampler 重新平衡 classes 来解决这个问题，我的交叉验证 f1_macro 分数从 0.415 飙升至 0.842。

我在这里读到，不正确地结合重采样和交叉验证会导致您的模型过度拟合。所以我想确保我不会在这里这样做。

def preprocess_text(text):
    try:
        text = re.sub('[^a-zA-Z]', ' ', text)
        text = text.lower().split()
        text = [word for word in text if not word in set(nltk.corpus.stopwords.words('english'))]
        text = [nltk.stem.WordNetLemmatizer().lemmatize(word) for word in text if len(word) > 1]
        return ' '.join(text)
    except TypeError:
        return ''

def preprocess_series(series):
    texts = []
    for i in range(len(series)):
        texts.append(preprocess_text(series[i]))
    return pd.Series(texts)

books_data = pd.DataFrame([
    ["A_Likely_Story.txt", "fantasy fiction:science fiction", "If you discovered a fantastic power like thi..."],
    ["All_Cats_Are_Gray.txt", "science fiction", "An odd story, made up of oddly assorted elem..."]
    ],columns=["title", "genre", "text"])

X = pd.DataFrame(preprocess_series(books_data["text"]),columns = ["text"])
Y = pd.Series([genres.split(":")[0] for genres in books_data["genre"]])

oversampler = RandomOverSampler()
x_ros, y_ros = oversampler.fit_resample(X, Y)

column_trans = compose.make_column_transformer(
    (TfidfVectorizer(ngram_range=(1,3)), "text")
)
ovr_svc_clf = multiclass.OneVsRestClassifier(svm.LinearSVC())

pipe = pipeline.make_pipeline(column_trans, ovr_svc_clf)

print(cross_val_score(
    pipe,
    X,
    Y,
    cv=3, 
    scoring="f1_macro"
).mean())

print(cross_val_score(
    pipe,
    x_ros,
    y_ros,
    cv=3, 
    scoring="f1_macro"
).mean())

这是我的 class 标签的分布。它是否足够小且不平衡导致过度拟合？

Answer 1

过采样不会导致过拟合。

cross-validation 拆分前的过采样导致 数据泄漏 ，您看到的分数确实不能用作未来性能的估计。您的测试折叠（可能）包含训练折叠中包含的相同数据点的副本。

您可以将过采样添加为管道中的第一步（并使用管道的 imblearn 版本，如果您还没有的话）来缓解这个问题。

综上所述，请尝试使用自定义决策阈值或 threshold-independent 指标在没有平衡的情况下建模。

RandomOverSampler 是否导致我的模型过度拟合？

Is RandomOverSampler Causing my Model to Overfit?

python

scikit-learn

multilabel-classification

overfitting-underfitting