mlflow 跟踪服务器在指定后端存储 uri 后不启动

mlflow tracking server does not start after specifying backend-store-uri

我运行 mlflow如下:

Dockerfile包含以下CMD命令

CMD mlflow server \
    --host 0.0.0.0 \
    --backend-store-uri "${BACKEND_STORE_URI}" \
    --default-artifact-root "${DEFAULT_ARTIFACT_ROOT}"

docker run --rm --name mlflow -p 5000:5000 -e BACKEND_STORE_URI=mssql+pymssql://user:pass@mybackendstoreuri/mlflow mlflow

之后

显示

INFO  [alembic.runtime.migration] Context impl MSSQLImpl.
INFO  [alembic.runtime.migration] Will assume transactional DDL.
INFO  [alembic.runtime.migration] Context impl MSSQLImpl.
INFO  [alembic.runtime.migration] Will assume transactional DDL.

但是,容器在没有启动服务器的情况下退出。

没有指定backend store uri,可以看到绑定host相关的日志,容器不存在

如何运行 mlflow 跟踪服务器并使用后端存储 uri?

根本原因是

MLflow UI and client code expects a default experiment with ID 0.
This method uses SQL insert statement to create the default experiment as a hack, since
experiment table uses 'experiment_id' column is a PK and is also set to auto increment.
MySQL and other implementation do not allow value '0' for such cases.

参考:https://github.com/mlflow/mlflow/blob/v1.2.0/mlflow/store/sqlalchemy_store.py#L171

迁移过程中没有报错,所以没有错误显示,静默失败时alembic版本是最新的。 参考:https://github.com/mlflow/mlflow/blob/v1.2.0/mlflow/store/db_migrations/env.py#L71

如果使用与MySQL测试相同的想法(https://github.com/mlflow/mlflow/blob/v1.2.0/mlflow/store/sqlalchemy_store.py#L171),则引发异常 - Cannot insert explicit value for identity column in table 'experiment' when IDENTITY_INSERT is set to OFF.

测试片段:

class TestSqlAlchemyStoreMssqlDb(unittest.TestCase):
    """
    Run tests against a MSSQL database
    """
    def setUp(self):
        db_username = "test"
        db_password = "test"
        host = "test"
        db_name = "TEST_DB"

        db_server_url = "mssql+pymssql://%s:%s@%s" % (db_username, db_password, host)
        self._engine = sqlalchemy.create_engine(db_server_url)

        self._db_url = "%s/%s" % (db_server_url, db_name)
        print("Connect to %s" % self._db_url)

    def test_store(self):
        self.store = SqlAlchemyStore(db_uri=self._db_url, default_artifact_root=ARTIFACT_URI)

如日志所示,使用 postgres 服务器完成迁移。

mlflow_1    | 2019/09/24 09:03:55 INFO mlflow.store.sqlalchemy_store: Creating initial MLflow database tables...
mlflow_1    | 2019/09/24 09:03:55 INFO mlflow.store.db.utils: Updating database tables at postgresql://postgres:postgres@postgres:5432/postgres
mlflow_1    | INFO  [alembic.runtime.migration] Context impl PostgresqlImpl.
mlflow_1    | INFO  [alembic.runtime.migration] Will assume transactional DDL.
mlflow_1    | INFO  [alembic.runtime.migration] Running upgrade  -> 451aebb31d03, add metric step
mlflow_1    | INFO  [alembic.runtime.migration] Running upgrade 451aebb31d03 -> 90e64c465722, migrate user column to tags
mlflow_1    | INFO  [alembic.runtime.migration] Running upgrade 90e64c465722 -> 181f10493468, allow nulls for metric values
mlflow_1    | INFO  [alembic.runtime.migration] Running upgrade 181f10493468 -> df50e92ffc5e, Add Experiment Tags Table
mlflow_1    | INFO  [alembic.runtime.migration] Running upgrade df50e92ffc5e -> 7ac759974ad8, Update run tags with larger limit
mlflow_1    | INFO  [alembic.runtime.migration] Context impl PostgresqlImpl.
mlflow_1    | INFO  [alembic.runtime.migration] Will assume transactional DDL.
mlflow_1    | [2019-09-24 09:03:55 +0000] [15] [INFO] Starting gunicorn 19.9.0
mlflow_1    | [2019-09-24 09:03:55 +0000] [15] [INFO] Listening at: http://0.0.0.0:5000 (15)
mlflow_1    | [2019-09-24 09:03:55 +0000] [15] [INFO] Using worker: sync
mlflow_1    | [2019-09-24 09:03:55 +0000] [18] [INFO] Booting worker with pid: 18
mlflow_1    | [2019-09-24 09:03:56 +0000] [22] [INFO] Booting worker with pid: 22
mlflow_1    | [2019-09-24 09:03:56 +0000] [26] [INFO] Booting worker with pid: 26
mlflow_1    | [2019-09-24 09:03:56 +0000] [27] [INFO] Booting worker with pid: 27