部署在 kubernetes 上的 Tika 服务器无法在气流中启动(从第四个同步 运行)
Tika server fails to start in airflow(from the fourth simultaneous run) deployed on kubernetes
想问问大家有没有遇到类似的错误
我在一家公司工作,我们使用气流,部署在 Azure kubernetes 上。
我们有一个 Dag 负责提取有关不同文档的一些信息。在我们从文档中提取的许多东西中,我们使用 tika 来提取 xml.
流程为:
- 我们上传了 10 个文件。
- 创建了 10 个不同的 DAG 以从文档中提取信息。
- 当使用 tika 提取 xml 时,一些 DAGS 开始失败,因为 tika 服务器无法自行初始化。
关于使用 tika-server 的任务的一些事实:
- 我们已将重试次数设置为 3
- 我们将此任务的同时执行限制为 3,因此它永远不会失败。
这是我们在 Airflow 中的任务:
text_extraction = KubernetesPodOperator(
task_id="text_extraction",
name="text_extraction",
namespace=DEFAULT_NAMESPACE,
image_pull_secrets=[k8s.V1LocalObjectReference('acr-pull')],
image=image_text_tools,
arguments=[
"tika-text-extract",
"--input-path", f"{xcom_pull_folder}/{BASIC_CONFIG_FACTORY.input_file_name}",
"--xml-path", f"{xcom_pull_folder}/{BASIC_CONFIG_FACTORY.xml_file_name}",
"--metadata-path", f"{xcom_pull_folder}/{BASIC_CONFIG_FACTORY.metadata_file_name}",
"--ocr"
],
get_logs=True,
is_delete_operator_pod=True,
startup_timeout_seconds=300,
volumes=[VOLUME.volume],
volume_mounts=[VOLUME.volume_mount1],
max_active_tis_per_dag=3,
retries=3,
retry_delay=timedelta(minutes=1),
)
我把错误留在这里,虽然我认为它没有太大帮助:
[2022-03-02, 09:27:33 UTC] {pod_manager.py:203} INFO - [cli.py: - parse_document() ] Extracting text with OCR enabled from: /opt/airflow/data/61d45f641b57d80819f9448f/6218edbbe40ccbfe96c6bdcd/20220225-145515_file/file
[2022-03-02, 09:27:34 UTC] {pod_manager.py:203} INFO - 2022-03-02, 09:27:34 UTC [MainThread ] [WARNI] Failed to see startup log message; retrying...
[2022-03-02, 09:27:34 UTC] {pod_manager.py:203} INFO - [tika.py: - startServer() ] Failed to see startup log message; retrying...
[2022-03-02, 09:27:39 UTC] {pod_manager.py:203} INFO - 2022-03-02, 09:27:39 UTC [MainThread ] [WARNI] Failed to see startup log message; retrying...
[2022-03-02, 09:27:39 UTC] {pod_manager.py:203} INFO - [tika.py: - startServer() ] Failed to see startup log message; retrying...
[2022-03-02, 09:27:44 UTC] {pod_manager.py:203} INFO - 2022-03-02, 09:27:44 UTC [MainThread ] [WARNI] Failed to see startup log message; retrying...
[2022-03-02, 09:27:44 UTC] {pod_manager.py:203} INFO - [tika.py: - startServer() ] Failed to see startup log message; retrying...
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - 2022-03-02, 09:27:49 UTC [MainThread ] [ERROR] Tika startup log message not received after 3 tries.
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - [tika.py: - startServer() ] Tika startup log message not received after 3 tries.
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - 2022-03-02, 09:27:49 UTC [MainThread ] [ERROR] Failed to receive startup confirmation from startServer.
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - [tika.py: - checkTikaServer() ] Failed to receive startup confirmation from startServer.
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - Traceback (most recent call last):
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - File "/text-tools/cli.py", line 128, in <module>
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - app()
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - File "/opt/venv/lib/python3.9/site-packages/typer/main.py", line 214, in __call__
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - return get_command(self)(*args, **kwargs)
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - File "/opt/venv/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - return self.main(*args, **kwargs)
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - File "/opt/venv/lib/python3.9/site-packages/click/core.py", line 1053, in main
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - rv = self.invoke(ctx)
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - File "/opt/venv/lib/python3.9/site-packages/click/core.py", line 1659, in invoke
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - return _process_result(sub_ctx.command.invoke(sub_ctx))
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - File "/opt/venv/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - return ctx.invoke(self.callback, **ctx.params)
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - File "/opt/venv/lib/python3.9/site-packages/click/core.py", line 754, in invoke
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - return __callback(*args, **kwargs)
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - File "/opt/venv/lib/python3.9/site-packages/typer/main.py", line 500, in wrapper
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - return callback(**use_params) # type: ignore
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - File "/text-tools/cli.py", line 99, in tika_text_extract
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - parse_document(
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - File "/text-tools/cli.py", line 28, in parse_document
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - parsed_pdf = parser.from_file(ip, xmlContent=True, requestOptions={"headers": headers, "timeout": timeout})
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - File "/opt/venv/lib/python3.9/site-packages/tika/parser.py", line 42, in from_file
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - output = parse1(service, filename, serverEndpoint, services={'meta': '/meta', 'text': '/tika', 'all': '/rmeta/xml'},
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - File "/opt/venv/lib/python3.9/site-packages/tika/tika.py", line 336, in parse1
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - status, response = callServer('put', serverEndpoint, service, f,
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - File "/opt/venv/lib/python3.9/site-packages/tika/tika.py", line 531, in callServer
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - serverEndpoint = checkTikaServer(scheme, serverHost, port, tikaServerJar, classpath, config_path)
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - File "/opt/venv/lib/python3.9/site-packages/tika/tika.py", line 601, in checkTikaServer
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - raise RuntimeError("Unable to start Tika server.")
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - RuntimeError: Unable to start Tika server.
我通过简单地将 TIKA_STARTUP_MAX_RETRY 更改为 5 来解决它,因为当我同时执行很多次执行时需要更长的时间才能开始。
想问问大家有没有遇到类似的错误
我在一家公司工作,我们使用气流,部署在 Azure kubernetes 上。
我们有一个 Dag 负责提取有关不同文档的一些信息。在我们从文档中提取的许多东西中,我们使用 tika 来提取 xml.
流程为:
- 我们上传了 10 个文件。
- 创建了 10 个不同的 DAG 以从文档中提取信息。
- 当使用 tika 提取 xml 时,一些 DAGS 开始失败,因为 tika 服务器无法自行初始化。
关于使用 tika-server 的任务的一些事实:
- 我们已将重试次数设置为 3
- 我们将此任务的同时执行限制为 3,因此它永远不会失败。
这是我们在 Airflow 中的任务:
text_extraction = KubernetesPodOperator(
task_id="text_extraction",
name="text_extraction",
namespace=DEFAULT_NAMESPACE,
image_pull_secrets=[k8s.V1LocalObjectReference('acr-pull')],
image=image_text_tools,
arguments=[
"tika-text-extract",
"--input-path", f"{xcom_pull_folder}/{BASIC_CONFIG_FACTORY.input_file_name}",
"--xml-path", f"{xcom_pull_folder}/{BASIC_CONFIG_FACTORY.xml_file_name}",
"--metadata-path", f"{xcom_pull_folder}/{BASIC_CONFIG_FACTORY.metadata_file_name}",
"--ocr"
],
get_logs=True,
is_delete_operator_pod=True,
startup_timeout_seconds=300,
volumes=[VOLUME.volume],
volume_mounts=[VOLUME.volume_mount1],
max_active_tis_per_dag=3,
retries=3,
retry_delay=timedelta(minutes=1),
)
我把错误留在这里,虽然我认为它没有太大帮助:
[2022-03-02, 09:27:33 UTC] {pod_manager.py:203} INFO - [cli.py: - parse_document() ] Extracting text with OCR enabled from: /opt/airflow/data/61d45f641b57d80819f9448f/6218edbbe40ccbfe96c6bdcd/20220225-145515_file/file
[2022-03-02, 09:27:34 UTC] {pod_manager.py:203} INFO - 2022-03-02, 09:27:34 UTC [MainThread ] [WARNI] Failed to see startup log message; retrying...
[2022-03-02, 09:27:34 UTC] {pod_manager.py:203} INFO - [tika.py: - startServer() ] Failed to see startup log message; retrying...
[2022-03-02, 09:27:39 UTC] {pod_manager.py:203} INFO - 2022-03-02, 09:27:39 UTC [MainThread ] [WARNI] Failed to see startup log message; retrying...
[2022-03-02, 09:27:39 UTC] {pod_manager.py:203} INFO - [tika.py: - startServer() ] Failed to see startup log message; retrying...
[2022-03-02, 09:27:44 UTC] {pod_manager.py:203} INFO - 2022-03-02, 09:27:44 UTC [MainThread ] [WARNI] Failed to see startup log message; retrying...
[2022-03-02, 09:27:44 UTC] {pod_manager.py:203} INFO - [tika.py: - startServer() ] Failed to see startup log message; retrying...
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - 2022-03-02, 09:27:49 UTC [MainThread ] [ERROR] Tika startup log message not received after 3 tries.
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - [tika.py: - startServer() ] Tika startup log message not received after 3 tries.
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - 2022-03-02, 09:27:49 UTC [MainThread ] [ERROR] Failed to receive startup confirmation from startServer.
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - [tika.py: - checkTikaServer() ] Failed to receive startup confirmation from startServer.
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - Traceback (most recent call last):
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - File "/text-tools/cli.py", line 128, in <module>
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - app()
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - File "/opt/venv/lib/python3.9/site-packages/typer/main.py", line 214, in __call__
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - return get_command(self)(*args, **kwargs)
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - File "/opt/venv/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - return self.main(*args, **kwargs)
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - File "/opt/venv/lib/python3.9/site-packages/click/core.py", line 1053, in main
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - rv = self.invoke(ctx)
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - File "/opt/venv/lib/python3.9/site-packages/click/core.py", line 1659, in invoke
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - return _process_result(sub_ctx.command.invoke(sub_ctx))
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - File "/opt/venv/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - return ctx.invoke(self.callback, **ctx.params)
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - File "/opt/venv/lib/python3.9/site-packages/click/core.py", line 754, in invoke
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - return __callback(*args, **kwargs)
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - File "/opt/venv/lib/python3.9/site-packages/typer/main.py", line 500, in wrapper
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - return callback(**use_params) # type: ignore
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - File "/text-tools/cli.py", line 99, in tika_text_extract
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - parse_document(
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - File "/text-tools/cli.py", line 28, in parse_document
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - parsed_pdf = parser.from_file(ip, xmlContent=True, requestOptions={"headers": headers, "timeout": timeout})
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - File "/opt/venv/lib/python3.9/site-packages/tika/parser.py", line 42, in from_file
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - output = parse1(service, filename, serverEndpoint, services={'meta': '/meta', 'text': '/tika', 'all': '/rmeta/xml'},
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - File "/opt/venv/lib/python3.9/site-packages/tika/tika.py", line 336, in parse1
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - status, response = callServer('put', serverEndpoint, service, f,
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - File "/opt/venv/lib/python3.9/site-packages/tika/tika.py", line 531, in callServer
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - serverEndpoint = checkTikaServer(scheme, serverHost, port, tikaServerJar, classpath, config_path)
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - File "/opt/venv/lib/python3.9/site-packages/tika/tika.py", line 601, in checkTikaServer
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - raise RuntimeError("Unable to start Tika server.")
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - RuntimeError: Unable to start Tika server.
我通过简单地将 TIKA_STARTUP_MAX_RETRY 更改为 5 来解决它,因为当我同时执行很多次执行时需要更长的时间才能开始。