让我的 python scikit 在 python-rq 队列 运行 中运行得更快?

Make my python scikit in function in python-rq queue run faster?

我目前有一个 utilities.py 文件具有此机器学习功能

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
import models
import random

words = [w.strip() for w in open('words.txt') if w == w.lower()]
def scramble(s):
    return "".join(random.sample(s, len(s)))

@models.db_session
def check_pronounceability(word):

    scrambled = [scramble(w) for w in words]

    X = words+scrambled
    y = ['word']*len(words) + ['unpronounceable']*len(scrambled)
    X_train, X_test, y_train, y_test = train_test_split(X, y)

    text_clf = Pipeline([
        ('vect', CountVectorizer(analyzer='char', ngram_range=(1, 3))),
        ('clf', MultinomialNB())
        ])
    text_clf = text_clf.fit(X_train, y_train)
    stuff = text_clf.predict_proba([word])
    pronounceability = round(100*stuff[0][1], 2)
    models.Word(word=word, pronounceability=pronounceability)
    models.commit()
    return pronounceability

然后我在 app.py

中调用它
from flask import Flask, render_template, jsonify, request
from rq import Queue
from rq.job import Job
from worker import conn
from flask_cors import CORS
from utilities import check_pronounceability

app = Flask(__name__)

q = Queue(connection=conn)

import models
@app.route('/api/word', methods=['POST', 'GET'])
@models.db_session
def check():
    if request.method == "POST":
        word = request.form['word']
        if not word:
            return render_template('index.html')
        db_word = models.Word.get(word=word)
        if not db_word:
            job = q.enqueue_call(check_pronounceability, args=(word,))
        return jsonify(job=job.id)

阅读 python-rq preformance notes 后,它指出

A pattern you can use to improve the throughput performance for these kind of jobs can be to import the necessary modules before the fork.

然后我使 worker.py 文件看起来像这样

import os

import redis
from rq import Worker, Queue, Connection

listen = ['default']

redis_url = os.getenv('REDISTOGO_URL', 'redis://localhost:6379')

conn = redis.from_url(redis_url)
import utilities

if __name__ == '__main__':
    with Connection(conn):
        worker = Worker(list(map(Queue, listen)))
        worker.work()

我遇到的问题是这仍然 运行 很慢,我做错了什么吗?当我检查一个词时,我可以通过将所有内容存储在内存中来使 运行 更快吗?根据 xpost I did in the python-rq 看来我正在正确导入它

我有几点建议:

  1. 在开始优化 python-rq 的吞吐量之前,请检查瓶颈在哪里。如果队列是瓶颈而不是 check_pronounceability 函数,我会感到惊讶。

  2. 确保check_pronounceability每次调用尽可能快地运行,忘记在这个阶段不相关的队列。

要优化check_pronounceability我建议你

  1. 全部 API 调用

    [=创建训练数据一次 54=]
  2. 忘记 train_test_split 你没有使用 test_split,那你为什么要浪费 CPU 周期来创建它

  3. all API 调用训练 NaiveBayes 一次 - check_pronounceability 的输入是一个需要 class 确定为可发音或不可发音的单词,无需为每个新单词创建一个新模型,只需创建一个模型并将其重复用于所有单词,这将带来好处也能产生稳定的结果,并且更容易改变模型

下面的修改建议

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import LabelBinarizer
from sklearn.model_selection import train_test_split
import models
import random

words = [w.strip() for w in open('words.txt') if w == w.lower()]
def scramble(s):
    return "".join(random.sample(s, len(s)))

scrambled = [scramble(w) for w in words]
X = words+scrambled
# explicitly create binary labels
label_binarizer = LabelBinarizer()
y = label_binarizer.fit_transform(['word']*len(words) + ['unpronounceable']*len(scrambled))

text_clf = Pipeline([
    ('vect', CountVectorizer(analyzer='char', ngram_range=(1, 3))),
    ('clf', MultinomialNB())
])
text_clf = text_clf.fit(X, y)
# you might want to persist the Pipeline to disk at this point to ensure it's not lost in case there is a crash    

@models.db_session
def check_pronounceability(word):
    stuff = text_clf.predict_proba([word])
    pronounceability = round(100*stuff[0][1], 2)
    models.Word(word=word, pronounceability=pronounceability)
    models.commit()
    return pronounceability

最后的笔记:

  • 我假设您已经在其他地方对模型进行了一些交叉验证,以实际发现它在预测标签概率方面做得很好,如果您没有,您应该这样做。

  • NaiveBayes 通常在产生可靠的 class 概率预测方面并不是最好的,它往往过于自信或过于胆小(概率接近 1 或 0)。您应该在数据库中检查它。使用 LogisticRegression classifier 应该会给你更可靠的概率预测。现在模型训练不是 API 调用的一部分,训练模型需要多长时间并不重要。