如何更新腌制文件的权重?

How to update the weights of a pickled file?

我每天都在 Google Cloud Scheduler 上训练校准分类器,运行 大约需要 5 分钟。我的 python 脚本接收最新数据(从那天开始)并将其连接到原始数据,然后对模型进行训练并将腌制文件保存在云存储中。我现在面临的问题是,如果它需要超过 5 分钟(在某个时候它会),它会给出上游请求超时错误。

我想,这是因为模型需要更多的时间来训练,我可以想到一种解决方案,我只在新数据上训练模型并更新 pickled 文件中原始模型的权重.但是,我不确定是否可行。

下面是我在调度程序上 运行 的函数:

def train_model():
    users, tasks, tags, task_tags, task_user, boards = connect_postgres()  ##loading the data from a postgres function
    storage_client = storage.Client()
    bucket = storage_client.get_bucket('my-bucket')
    blob = bucket.blob('original_data.pkl')
    pickle_in0 = blob.download_as_string()
    data = pickle.loads(pickle_in0)

    tasks = tasks.rename(columns={'id': 'task_id', 'name': 'task_name'})

    # Joining tasks and task_user_assigns tables
    tasks = tasks[tasks.task_name.isnull() == False]
    task_user = task_user[['id', 'task_id', 'user_id']].rename(columns={'id': 'task_user_id'})
    task_data = tasks.merge(task_user, on='task_id', how='left')

    # Joining users with the task_data
    users = users[['id', 'email']].rename(columns={'id': 'user_id'})
    users_tasks = task_data.merge(users, on='user_id', how='left')
    users_tasks = users_tasks[users_tasks.user_id.isnull() == False].reset_index(drop=True)

    # Joining boards table to user_tasks
    boards = boards[['id', 'name']].rename(columns={'id': 'board_id', 'name': 'board_name'})
    users_board = users_tasks.merge(boards, on='board_id', how='left').reset_index(drop=True)

    # Data Cleaning
    translator = Translator()  # This is to translate if the tasks are not in English
    users_board["task_trans"] = users_board["task_name"].map(lambda x: translator.translate(x, dest="en").text)

    users_board['task_trans'] = users_board['task_trans'].apply(lambda x: remove_emoji(x))  #This calls a function to remove Emoticons from text
    users_board['task_trans'] = users_board['task_trans'].apply(lambda x: remove_punct(x))  #This calls a function to remove punctuations from text

    users_board = users_board[['task_id', 'email', 'board_id', 'user_id', 'task_trans']]

    data1 = pd.concat([data, users_board], axis=0)

    df1 = data1.copy

    X = df1.task_trans  #all the observations
    y = df1.user_id  #all the lables

    print(y.nunique())

    #FROM HERE ON, THE TRAINING SCRIPT BEGINS

    count_vect = CountVectorizer()
    X_train_counts = count_vect.fit_transform(X)
    tf_transformer = TfidfTransformer().fit(X_train_counts)
    X_train_transformed = tf_transformer.transform(X_train_counts)

    print('model 1 done')

    labels = LabelEncoder()
    y_train_labels_fit = labels.fit(y)
    y_train_lables_trf = labels.transform(y)

    linear_svc = LinearSVC()
    clf = linear_svc.fit(X_train_transformed, y_train_lables_trf)

    print('model 2 done')

    calibrated_svc = CalibratedClassifierCV(base_estimator=linear_svc, cv="prefit")
    calibrated_svc.fit(X_train_transformed, y_train_lables_trf)

    print('model 3 done')

    # SAVING THE MODELS ON GOOGLE CLOUD STORAGE

    # storage_client = storage.Client()
    fs = gcsfs.GCSFileSystem(project='my-project')

    filename = '~path/svc.sav'
    pickle.dump(calibrated_svc, fs.open(filename, 'wb'))

    filename = '~path/count_vectorizer.sav'
    pickle.dump(count_vect, fs.open(filename, 'wb'))

    filename = '~path/tfidf_vectorizer.sav'
    pickle.dump(tf_transformer, fs.open(filename, 'wb'))

    blob = bucket.blob('data.pkl')
    pickle_out = pickle.dumps(df1)
    blob.upload_from_string(pickle_out)

    return "success"

知道如何实现吗?或者我可以遵循的任何其他策略来解决这个问题?

我找不到更新 pickle 文件权重的方法,最终通过增加云中的超时参数 运行 超过训练时间来解决,暂时解决了这个问题。