部署在 kubernetes 上的 Tika 服务器无法在气流中启动(从第四个同步 运行)

Tika server fails to start in airflow(from the fourth simultaneous run) deployed on kubernetes

想问问大家有没有遇到类似的错误

我在一家公司工作,我们使用气流,部署在 Azure kubernetes 上。

我们有一个 Dag 负责提取有关不同文档的一些信息。在我们从文档中提取的许多东西中,我们使用 tika 来提取 xml.

流程为:

关于使用 tika-server 的任务的一些事实:

这是我们在 Airflow 中的任务:

 text_extraction = KubernetesPodOperator(
        task_id="text_extraction",
        name="text_extraction",
        namespace=DEFAULT_NAMESPACE,
        image_pull_secrets=[k8s.V1LocalObjectReference('acr-pull')],
        image=image_text_tools,
        arguments=[
            "tika-text-extract",
            "--input-path", f"{xcom_pull_folder}/{BASIC_CONFIG_FACTORY.input_file_name}",
            "--xml-path", f"{xcom_pull_folder}/{BASIC_CONFIG_FACTORY.xml_file_name}",
            "--metadata-path", f"{xcom_pull_folder}/{BASIC_CONFIG_FACTORY.metadata_file_name}",
            "--ocr"
        ],
        get_logs=True,
        is_delete_operator_pod=True,
        startup_timeout_seconds=300,
        volumes=[VOLUME.volume],
        volume_mounts=[VOLUME.volume_mount1],
        max_active_tis_per_dag=3,
        retries=3,
        retry_delay=timedelta(minutes=1),
    )

我把错误留在这里,虽然我认为它没有太大帮助:

[2022-03-02, 09:27:33 UTC] {pod_manager.py:203} INFO - [cli.py: - parse_document() ] Extracting text with OCR enabled from: /opt/airflow/data/61d45f641b57d80819f9448f/6218edbbe40ccbfe96c6bdcd/20220225-145515_file/file
[2022-03-02, 09:27:34 UTC] {pod_manager.py:203} INFO - 2022-03-02, 09:27:34 UTC [MainThread  ] [WARNI]  Failed to see startup log message; retrying...
[2022-03-02, 09:27:34 UTC] {pod_manager.py:203} INFO - [tika.py: - startServer() ] Failed to see startup log message; retrying...
[2022-03-02, 09:27:39 UTC] {pod_manager.py:203} INFO - 2022-03-02, 09:27:39 UTC [MainThread  ] [WARNI]  Failed to see startup log message; retrying...
[2022-03-02, 09:27:39 UTC] {pod_manager.py:203} INFO - [tika.py: - startServer() ] Failed to see startup log message; retrying...
[2022-03-02, 09:27:44 UTC] {pod_manager.py:203} INFO - 2022-03-02, 09:27:44 UTC [MainThread  ] [WARNI]  Failed to see startup log message; retrying...
[2022-03-02, 09:27:44 UTC] {pod_manager.py:203} INFO - [tika.py: - startServer() ] Failed to see startup log message; retrying...
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - 2022-03-02, 09:27:49 UTC [MainThread  ] [ERROR]  Tika startup log message not received after 3 tries.
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - [tika.py: - startServer() ] Tika startup log message not received after 3 tries.
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - 2022-03-02, 09:27:49 UTC [MainThread  ] [ERROR]  Failed to receive startup confirmation from startServer.
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - [tika.py: - checkTikaServer() ] Failed to receive startup confirmation from startServer.
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - Traceback (most recent call last):
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO -   File "/text-tools/cli.py", line 128, in <module>
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO -     app()
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO -   File "/opt/venv/lib/python3.9/site-packages/typer/main.py", line 214, in __call__
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO -     return get_command(self)(*args, **kwargs)
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO -   File "/opt/venv/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO -     return self.main(*args, **kwargs)
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO -   File "/opt/venv/lib/python3.9/site-packages/click/core.py", line 1053, in main
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO -     rv = self.invoke(ctx)
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO -   File "/opt/venv/lib/python3.9/site-packages/click/core.py", line 1659, in invoke
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO -     return _process_result(sub_ctx.command.invoke(sub_ctx))
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO -   File "/opt/venv/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO -     return ctx.invoke(self.callback, **ctx.params)
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO -   File "/opt/venv/lib/python3.9/site-packages/click/core.py", line 754, in invoke
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO -     return __callback(*args, **kwargs)
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO -   File "/opt/venv/lib/python3.9/site-packages/typer/main.py", line 500, in wrapper
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO -     return callback(**use_params)  # type: ignore
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO -   File "/text-tools/cli.py", line 99, in tika_text_extract
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO -     parse_document(
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO -   File "/text-tools/cli.py", line 28, in parse_document
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO -     parsed_pdf = parser.from_file(ip, xmlContent=True, requestOptions={"headers": headers, "timeout": timeout})
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO -   File "/opt/venv/lib/python3.9/site-packages/tika/parser.py", line 42, in from_file
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO -     output = parse1(service, filename, serverEndpoint, services={'meta': '/meta', 'text': '/tika', 'all': '/rmeta/xml'},
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO -   File "/opt/venv/lib/python3.9/site-packages/tika/tika.py", line 336, in parse1
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO -     status, response = callServer('put', serverEndpoint, service, f,
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO -   File "/opt/venv/lib/python3.9/site-packages/tika/tika.py", line 531, in callServer
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO -     serverEndpoint = checkTikaServer(scheme, serverHost, port, tikaServerJar, classpath, config_path)
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO -   File "/opt/venv/lib/python3.9/site-packages/tika/tika.py", line 601, in checkTikaServer
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO -     raise RuntimeError("Unable to start Tika server.")
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - RuntimeError: Unable to start Tika server.

我通过简单地将 TIKA_STARTUP_MAX_RETRY 更改为 5 来解决它,因为当我同时执行很多次执行时需要更长的时间才能开始。