如何使用 mlflow/tensorflow 跟踪纪元的损失?

How to I track loss at epoch using mlflow/tensorflow?

我想使用 mlflow 来跟踪 TensorFlow 模型的开发。我如何记录每个时期的损失?我写了下面的代码:

mlflow.set_tracking_uri(tracking_uri)

mlflow.set_experiment("/deep_learning")
with mlflow.start_run():
    mlflow.log_param("batch_size", batch_size)
    mlflow.log_param("learning_rate", learning_rate)
    mlflow.log_param("epochs", epochs)
    mlflow.log_param("Optimizer", opt)
    mlflow.log_metric("train_loss", train_loss)
    mlflow.log_metric("val_loss", val_loss)
    mlflow.log_metric("test_loss", test_loss)
    mlflow.log_metric("test_mse", test_mse)
    mlflow.log_artifacts("./model")

如果我将 train_loss 和 val_loss 更改为

train_loss = history.history['loss']
val_loss = history.history['val_loss']

我收到以下错误:

mlflow.exceptions.MlflowException: Got invalid value [12.041399002075195] for metric 'train_loss' (timestamp=1649783654667). Please specify value as a valid double (64-bit floating point)

如何在所有时期保存损失和 val_loss,以便我可以在 mlflow 中可视化学习曲线?

如您所见here。您可以使用 mlflow.tensorflow.autolog() 和这个(来自文档):

Enables (or disables) and configures autologging from Keras to MLflow. Autologging captures the following information:

fit() or fit_generator() parameters; optimizer name; learning rate; epsilon ...

例如:

# !pip install mlflow
import tensorflow as tf
import mlflow
import numpy as np


X_train = np.random.rand(100,100)
y_train = np.random.randint(0,10,100)
    

model = tf.keras.Sequential()
model.add(tf.keras.Input(100,))
model.add(tf.keras.layers.Dense(256, activation='relu'))
model.add(tf.keras.layers.Dropout(rate=.4))
model.add(tf.keras.layers.Dense(10, activation='sigmoid'))        
model.compile(loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False),
              optimizer='Adam', 
              metrics=['accuracy'])
model.summary()


mlflow.tensorflow.autolog()
history = model.fit(X_train, y_train, epochs=100, batch_size=50)

或者正如您在评论中提到的,您可以使用 mlflow.set_tracking_uri(),如下所示:

mlflow.set_tracking_uri('http://127.0.0.1:5000')
tracking_uri = mlflow.get_tracking_uri()
with mlflow.start_run(run_name='PARENT_RUN') as parent_run:
    batch_size=50
    history = model.fit(X_train, y_train, epochs=2, batch_size=batch_size)
    mlflow.log_param("batch_size", batch_size)  

获取结果:

!mlflow ui

输出:

[....] [...] [INFO] Starting gunicorn 20.1.0
[....] [...] [INFO] Listening at: http://127.0.0.1:5000 (****)
[....] [...] [INFO] Using worker: sync
[....] [...] [INFO] Booting worker with pid: ****