Scikit-learn out-of-core 文本分类内存消耗

Scikit-learn out-of-core text classification memory consumption

我正在尝试使用 scikit-learn 对大量文本文档进行分类,尽管我使用的是程序的核心外功能(使用 SGDClassifierHashingVectorizer)似乎消耗了大量 RAM (>10GB)。在此之前,我执行了词形还原并从文本数据中删除了停用词。我觉得我在这里错过了一些重要的东西。你能发现我的代码中的错误吗?

非常感谢您的任何建议!

这是我的 python 代码:

import time
import numpy as np
import os
import re
import pyprind
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import SGDClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

directory = "mydirectory"
batch_size = 1000
n_batches = 44
pbar = pyprind.ProgBar(n_batches)

class Doc_Iterable:
    def __init__(self, file):
        self.file = file
    def __iter__(self):
        for line in self.file:
            line = re.sub('[^\w\s]|(.\d{1,4}[\./]\d{1,2}[\./]\d{1,4})|(\s\d{1,})', '', line)
            yield line


def stream_docs(path, texts_file, labels_file):
    with open(path + texts_file, 'r') as fX, open(path + labels_file, 'r') as fy:
        for text in fX:
            label = next(fy)
            text = re.sub('[^\w\s]|(.\d{1,4}[\./]\d{1,2}[\./]\d{1,4})|(\s\d{1,})', '', text)
            yield text, label

def get_minibatch(doc_stream, size):
    X, y = [], []
    for _ in range(size):
        text, label = next(doc_stream)
        X.append(text)
        y.append(label)
    return X, y


classes = set()
for label in open(directory + 'y_train', 'r'):
    classes.add(label)
for label in open(directory + 'y_test', 'r'):
    classes.add(label)
classes = list(classes)

validation_scores = []
training_set_size = []

h_vectorizer = HashingVectorizer(lowercase=True, ngram_range=(1,1))
clf = SGDClassifier(loss='hinge', n_iter=5, alpha=1e-4, shuffle=True)

doc_stream = stream_docs(path=directory, texts_file='X_train', labels_file='y_train')
n_samples = 0
iteration = 0

for _ in range(n_batches):
    print("Training with batch nr.", iteration)
    iteration += 1

    X_train, y_train = get_minibatch(doc_stream, size=batch_size)

    n_samples += len(X_train)

    X_train = h_vectorizer.transform(X_train)

    clf.partial_fit(X_train, y_train, classes=classes)

    pbar.update()


del X_train
del y_train
print("Training complete. Classifier trained with " + str(n_samples) + " samples.")
print()
print("Testing...")
print()
X_test = h_vectorizer.transform(Doc_Iterable(open(directory + 'X_test')))
y_test = np.genfromtxt(directory + 'y_test', dtype=None, delimiter='|').astype(str)
prediction = clf.predict(X_test)
score = metrics.accuracy_score(y_test, prediction)
print("Accuracy: ", score)
print()

这可能不是一个答案(抱歉,如果我由于声誉问题无法发表评论),但我从事过图像分类项目。

根据我的经验,使用 scikit-learn 进行训练非常慢(在我的例子中,我使用了大约 30 张图像,我花了将近 2-6 分钟来训练分类器)。当我切换到 OpenCV-python 时,使用相同数量的训练数据训练相同的分类器只需要大约一分钟或更短的时间。

尝试在 HashingVectorizer 中调整 n_features,例如:

h_vectorizer = HashingVectorizer(n_features=10000, lowercase=True, ngram_range=(1,1))

使用默认参数 (n_features=1048576),您可以期望变换后的矩阵最多具有:

1048576(features) x 1000(mini batch size) x 8 bytes = 8.4 GB

由于稀疏性,它会小于那个,但是分类器的系数会加起来:

1048576(features) x len(classes) * 8 bytes

这样可以解释您当前的内存使用情况。