如何使用 mlflow/tensorflow 跟踪纪元的损失?
How to I track loss at epoch using mlflow/tensorflow?
我想使用 mlflow 来跟踪 TensorFlow 模型的开发。我如何记录每个时期的损失?我写了下面的代码:
mlflow.set_tracking_uri(tracking_uri)
mlflow.set_experiment("/deep_learning")
with mlflow.start_run():
mlflow.log_param("batch_size", batch_size)
mlflow.log_param("learning_rate", learning_rate)
mlflow.log_param("epochs", epochs)
mlflow.log_param("Optimizer", opt)
mlflow.log_metric("train_loss", train_loss)
mlflow.log_metric("val_loss", val_loss)
mlflow.log_metric("test_loss", test_loss)
mlflow.log_metric("test_mse", test_mse)
mlflow.log_artifacts("./model")
如果我将 train_loss 和 val_loss 更改为
train_loss = history.history['loss']
val_loss = history.history['val_loss']
我收到以下错误:
mlflow.exceptions.MlflowException: Got invalid value [12.041399002075195] for metric 'train_loss' (timestamp=1649783654667). Please specify value as a valid double (64-bit floating point)
如何在所有时期保存损失和 val_loss,以便我可以在 mlflow 中可视化学习曲线?
如您所见here。您可以使用 mlflow.tensorflow.autolog()
和这个(来自文档):
Enables (or disables) and configures autologging from Keras to MLflow. Autologging captures the following information:
fit() or fit_generator() parameters; optimizer name; learning rate; epsilon
...
例如:
# !pip install mlflow
import tensorflow as tf
import mlflow
import numpy as np
X_train = np.random.rand(100,100)
y_train = np.random.randint(0,10,100)
model = tf.keras.Sequential()
model.add(tf.keras.Input(100,))
model.add(tf.keras.layers.Dense(256, activation='relu'))
model.add(tf.keras.layers.Dropout(rate=.4))
model.add(tf.keras.layers.Dense(10, activation='sigmoid'))
model.compile(loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False),
optimizer='Adam',
metrics=['accuracy'])
model.summary()
mlflow.tensorflow.autolog()
history = model.fit(X_train, y_train, epochs=100, batch_size=50)
或者正如您在评论中提到的,您可以使用 mlflow.set_tracking_uri()
,如下所示:
mlflow.set_tracking_uri('http://127.0.0.1:5000')
tracking_uri = mlflow.get_tracking_uri()
with mlflow.start_run(run_name='PARENT_RUN') as parent_run:
batch_size=50
history = model.fit(X_train, y_train, epochs=2, batch_size=batch_size)
mlflow.log_param("batch_size", batch_size)
获取结果:
!mlflow ui
输出:
[....] [...] [INFO] Starting gunicorn 20.1.0
[....] [...] [INFO] Listening at: http://127.0.0.1:5000 (****)
[....] [...] [INFO] Using worker: sync
[....] [...] [INFO] Booting worker with pid: ****
我想使用 mlflow 来跟踪 TensorFlow 模型的开发。我如何记录每个时期的损失?我写了下面的代码:
mlflow.set_tracking_uri(tracking_uri)
mlflow.set_experiment("/deep_learning")
with mlflow.start_run():
mlflow.log_param("batch_size", batch_size)
mlflow.log_param("learning_rate", learning_rate)
mlflow.log_param("epochs", epochs)
mlflow.log_param("Optimizer", opt)
mlflow.log_metric("train_loss", train_loss)
mlflow.log_metric("val_loss", val_loss)
mlflow.log_metric("test_loss", test_loss)
mlflow.log_metric("test_mse", test_mse)
mlflow.log_artifacts("./model")
如果我将 train_loss 和 val_loss 更改为
train_loss = history.history['loss']
val_loss = history.history['val_loss']
我收到以下错误:
mlflow.exceptions.MlflowException: Got invalid value [12.041399002075195] for metric 'train_loss' (timestamp=1649783654667). Please specify value as a valid double (64-bit floating point)
如何在所有时期保存损失和 val_loss,以便我可以在 mlflow 中可视化学习曲线?
如您所见here。您可以使用 mlflow.tensorflow.autolog()
和这个(来自文档):
Enables (or disables) and configures autologging from Keras to MLflow. Autologging captures the following information:
fit() or fit_generator() parameters; optimizer name; learning rate; epsilon ...
例如:
# !pip install mlflow
import tensorflow as tf
import mlflow
import numpy as np
X_train = np.random.rand(100,100)
y_train = np.random.randint(0,10,100)
model = tf.keras.Sequential()
model.add(tf.keras.Input(100,))
model.add(tf.keras.layers.Dense(256, activation='relu'))
model.add(tf.keras.layers.Dropout(rate=.4))
model.add(tf.keras.layers.Dense(10, activation='sigmoid'))
model.compile(loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False),
optimizer='Adam',
metrics=['accuracy'])
model.summary()
mlflow.tensorflow.autolog()
history = model.fit(X_train, y_train, epochs=100, batch_size=50)
或者正如您在评论中提到的,您可以使用 mlflow.set_tracking_uri()
,如下所示:
mlflow.set_tracking_uri('http://127.0.0.1:5000')
tracking_uri = mlflow.get_tracking_uri()
with mlflow.start_run(run_name='PARENT_RUN') as parent_run:
batch_size=50
history = model.fit(X_train, y_train, epochs=2, batch_size=batch_size)
mlflow.log_param("batch_size", batch_size)
获取结果:
!mlflow ui
输出:
[....] [...] [INFO] Starting gunicorn 20.1.0
[....] [...] [INFO] Listening at: http://127.0.0.1:5000 (****)
[....] [...] [INFO] Using worker: sync
[....] [...] [INFO] Booting worker with pid: ****