Docker python 对

Docker python tika

我想创建一个 Docker 文件,将所有必要的组件安装到 运行 python-tika 中的 Docker 容器中。

到目前为止这是我的Docker文件:

###Get python
FROM python:3

RUN pip3 install --upgrade pip requests
RUN pip3 install python-docx tika numpy pandas

RUN mkdir scripts

ADD runner.py /scripts/

CMD [ "python", "./scripts/runner.py" ]

我构建它并 运行 Docker 文件:

docker build -t docker-tika .

docker run docker-tika

但它报错如下:

[~/Documents/BERT_DV/Docker_Parser] $ docker run docker-tika
2020-05-08 13:49:52,528 [MainThread  ] [INFO ]  Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.24/tika-server-1.24.jar to /tmp/tika-server.jar.
2020-05-08 13:50:09,742 [MainThread  ] [INFO ]  Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.24/tika-server-1.24.jar.md5 to /tmp/tika-server.jar.md5.
2020-05-08 13:50:10,133 [MainThread  ] [ERROR]  Unable to run java; is it installed?
2020-05-08 13:50:10,134 [MainThread  ] [ERROR]  Failed to receive startup confirmation from startServer.
2020-05-08 13:50:10,271 [MainThread  ] [ERROR]  Unable to run java; is it installed?
2020-05-08 13:50:10,271 [MainThread  ] [ERROR]  Failed to receive startup confirmation from startServer.

runner.py脚本如下:

import tika
tika.initVM()

我有以下两个问题: 1. 我读到我们需要下载 tika-server jar 2. 在 python 脚本中调用 initVM() 在后台启动 tika-server。

我不知道里面少了什么。 Docker文件。感谢帮助!

I have update Docker file with Java as well and still it's complaining about Java

### 1. Get Linux
FROM alpine:3.7

### 2. Get Java via the package manager
RUN apk update \
&& apk upgrade \
&& apk add --no-cache bash \
&& apk add --no-cache --virtual=build-dependencies unzip \
&& apk add --no-cache curl \
&& apk add --no-cache openjdk8-jre

ENV JAVA_HOME=/opt/java/openjdk \
    PATH="/opt/java/openjdk/bin:$PATH"

###3. Get ython
FROM python:3

RUN pip3 install --upgrade pip requests
RUN pip3 install python-docx tika numpy pandas

RUN mkdir scripts
RUN mkdir pdfs
RUN mkdir output

ADD runner2.py /scripts/
ADD sample.pdf .

CMD [ "python", "./scripts/runner2.py" ]

猫runner2.py:

#!/usr/bin/env python
import tika
from tika import parser
parsed = parser.from_file('sample.pdf')
print(parsed["metadata"])
print(parsed["content"])

[~/Documents/BERT_DV/Docker_Parser] $ docker 运行 docker-提卡

2020-05-08 14:40:23,183 [MainThread  ] [INFO ]  Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.24/tika-server-1.24.jar to /tmp/tika-server.jar.
2020-05-08 14:41:00,480 [MainThread  ] [INFO ]  Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.24/tika-server-1.24.jar.md5 to /tmp/tika-server.jar.md5.
2020-05-08 14:41:02,324 [MainThread  ] [ERROR]  Unable to run java; is it installed?
2020-05-08 14:41:02,324 [MainThread  ] [ERROR]  Failed to receive startup confirmation from startServer.

来自 tika-s github:

要使用此库,您需要在系统上安装 Java 7+,因为 tika-python 在后台启动 Tika REST 服务器。

所以你需要java,但是python:3图像中没有java。 有一些解决方案

  1. 查找 python 并安装 tika docker 图片
  2. 使用单独的图像
  3. 在 python:3 上手动安装 java,将 java 安装命令添加到您的 Dockerfile
  4. 在 java 图像上安装 python

我没有评论的名誉,所以在这里发帖。

看来,您的 Dockerfile 现在正在制作 multi-stage build,Java 不再处于最后阶段 - 之前的阶段已被删除。

正如 Giga Kokaia 之前和其他人所说,Java 是问题所在。看起来你想用单个 Dockerfile 来做。例如,可以通过将 Alpine 作为基础映像来实现,但是您需要一些额外的依赖项才能安装 Python 和所需的 pip 包。当与许多库一起使用时,Alpine 可能不是 Python 的最佳基础,因为它不使用 libc 库。然而,这里是非常粗略更新的 Dockerfile:

### 1. Get Linux
FROM alpine:3.7

### 2. Get Java via the package manager
RUN apk update \
&& apk upgrade \
&& apk add --no-cache bash \
&& apk add --no-cache --virtual=build-dependencies unzip \
&& apk add --no-cache curl \
&& apk add --no-cache openjdk8-jre \
&& apk add python3 python3-dev gcc g++ gfortran musl-dev libxml2-dev libxslt-dev

ENV JAVA_HOME=/opt/java/openjdk \
    PATH="/opt/java/openjdk/bin:$PATH"


RUN pip3 install --upgrade pip requests
RUN pip3 install python-docx wheel tika numpy 
RUN pip3 install pandas

RUN mkdir scripts
RUN mkdir pdfs
RUN mkdir output

ADD runner2.py /scripts/
ADD sample.pdf .

CMD [ "python3", "./scripts/runner2.py"  ]

我重新发布@anapaulagomes 的评论作为答案,因为这是我在谷歌上搜索的内容 -- 运行 Tika 作为 Docker 容器:

I managed to solve this by using Tika as a separate service (which had better performance than having it in the same image). But instead of running Tika's jar, I consume its API. You only need to configure the environment variables TIKA_CLIENT_ONLY: 1 and TIKA_SERVER_ENDPOINT: tika:9998. You can see the Dockerfile and docker-compose.yml here: https://github.com/DadosAbertosDeFeira/maria-quiteria

您可以使用

启动Tika服务
docker run --rm -t -d --name my_tika --net my-network \
         -p 9998:9998 apache/tika:1.27

或将此添加到您的 docker-compose.yml:

tika:
    image: apache/tika
    ports:
        - "9998:9998"

这允许您调用 from tika import parser 并进行解析,而无需调用 tika.initVM()。