Docker python 对
Docker python tika
我想创建一个 Docker 文件,将所有必要的组件安装到 运行 python-tika 中的 Docker 容器中。
到目前为止这是我的Docker文件:
###Get python
FROM python:3
RUN pip3 install --upgrade pip requests
RUN pip3 install python-docx tika numpy pandas
RUN mkdir scripts
ADD runner.py /scripts/
CMD [ "python", "./scripts/runner.py" ]
我构建它并 运行 Docker 文件:
docker build -t docker-tika .
docker run docker-tika
但它报错如下:
[~/Documents/BERT_DV/Docker_Parser] $ docker run docker-tika
2020-05-08 13:49:52,528 [MainThread ] [INFO ] Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.24/tika-server-1.24.jar to /tmp/tika-server.jar.
2020-05-08 13:50:09,742 [MainThread ] [INFO ] Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.24/tika-server-1.24.jar.md5 to /tmp/tika-server.jar.md5.
2020-05-08 13:50:10,133 [MainThread ] [ERROR] Unable to run java; is it installed?
2020-05-08 13:50:10,134 [MainThread ] [ERROR] Failed to receive startup confirmation from startServer.
2020-05-08 13:50:10,271 [MainThread ] [ERROR] Unable to run java; is it installed?
2020-05-08 13:50:10,271 [MainThread ] [ERROR] Failed to receive startup confirmation from startServer.
runner.py脚本如下:
import tika
tika.initVM()
我有以下两个问题:
1. 我读到我们需要下载 tika-server jar
2. 在 python 脚本中调用 initVM() 在后台启动 tika-server。
我不知道里面少了什么。 Docker文件。感谢帮助!
I have update Docker file with Java as well and still it's complaining about Java
### 1. Get Linux
FROM alpine:3.7
### 2. Get Java via the package manager
RUN apk update \
&& apk upgrade \
&& apk add --no-cache bash \
&& apk add --no-cache --virtual=build-dependencies unzip \
&& apk add --no-cache curl \
&& apk add --no-cache openjdk8-jre
ENV JAVA_HOME=/opt/java/openjdk \
PATH="/opt/java/openjdk/bin:$PATH"
###3. Get ython
FROM python:3
RUN pip3 install --upgrade pip requests
RUN pip3 install python-docx tika numpy pandas
RUN mkdir scripts
RUN mkdir pdfs
RUN mkdir output
ADD runner2.py /scripts/
ADD sample.pdf .
CMD [ "python", "./scripts/runner2.py" ]
猫runner2.py:
#!/usr/bin/env python
import tika
from tika import parser
parsed = parser.from_file('sample.pdf')
print(parsed["metadata"])
print(parsed["content"])
[~/Documents/BERT_DV/Docker_Parser] $ docker 运行 docker-提卡
2020-05-08 14:40:23,183 [MainThread ] [INFO ] Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.24/tika-server-1.24.jar to /tmp/tika-server.jar.
2020-05-08 14:41:00,480 [MainThread ] [INFO ] Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.24/tika-server-1.24.jar.md5 to /tmp/tika-server.jar.md5.
2020-05-08 14:41:02,324 [MainThread ] [ERROR] Unable to run java; is it installed?
2020-05-08 14:41:02,324 [MainThread ] [ERROR] Failed to receive startup confirmation from startServer.
来自 tika-s github:
要使用此库,您需要在系统上安装 Java 7+,因为 tika-python 在后台启动 Tika REST 服务器。
所以你需要java,但是python:3
图像中没有java。
有一些解决方案
- 查找 python 并安装 tika docker 图片
- 使用单独的图像
- 在 python:3 上手动安装 java,将 java 安装命令添加到您的 Dockerfile
- 在 java 图像上安装 python
我没有评论的名誉,所以在这里发帖。
看来,您的 Dockerfile 现在正在制作 multi-stage build,Java 不再处于最后阶段 - 之前的阶段已被删除。
正如 Giga Kokaia 之前和其他人所说,Java 是问题所在。看起来你想用单个 Dockerfile 来做。例如,可以通过将 Alpine 作为基础映像来实现,但是您需要一些额外的依赖项才能安装 Python 和所需的 pip 包。当与许多库一起使用时,Alpine 可能不是 Python 的最佳基础,因为它不使用 libc 库。然而,这里是非常粗略更新的 Dockerfile:
### 1. Get Linux
FROM alpine:3.7
### 2. Get Java via the package manager
RUN apk update \
&& apk upgrade \
&& apk add --no-cache bash \
&& apk add --no-cache --virtual=build-dependencies unzip \
&& apk add --no-cache curl \
&& apk add --no-cache openjdk8-jre \
&& apk add python3 python3-dev gcc g++ gfortran musl-dev libxml2-dev libxslt-dev
ENV JAVA_HOME=/opt/java/openjdk \
PATH="/opt/java/openjdk/bin:$PATH"
RUN pip3 install --upgrade pip requests
RUN pip3 install python-docx wheel tika numpy
RUN pip3 install pandas
RUN mkdir scripts
RUN mkdir pdfs
RUN mkdir output
ADD runner2.py /scripts/
ADD sample.pdf .
CMD [ "python3", "./scripts/runner2.py" ]
我重新发布@anapaulagomes 的评论作为答案,因为这是我在谷歌上搜索的内容 -- 运行 Tika 作为 Docker 容器:
I managed to solve this by using Tika as a separate service (which had
better performance than having it in the same image). But instead of
running Tika's jar, I consume its API. You only need to configure the
environment variables TIKA_CLIENT_ONLY: 1
and TIKA_SERVER_ENDPOINT: tika:9998
.
You can see the Dockerfile and docker-compose.yml here:
https://github.com/DadosAbertosDeFeira/maria-quiteria
您可以使用
启动Tika服务
docker run --rm -t -d --name my_tika --net my-network \
-p 9998:9998 apache/tika:1.27
或将此添加到您的 docker-compose.yml:
tika:
image: apache/tika
ports:
- "9998:9998"
这允许您调用 from tika import parser
并进行解析,而无需调用 tika.initVM()。
我想创建一个 Docker 文件,将所有必要的组件安装到 运行 python-tika 中的 Docker 容器中。
到目前为止这是我的Docker文件:
###Get python
FROM python:3
RUN pip3 install --upgrade pip requests
RUN pip3 install python-docx tika numpy pandas
RUN mkdir scripts
ADD runner.py /scripts/
CMD [ "python", "./scripts/runner.py" ]
我构建它并 运行 Docker 文件:
docker build -t docker-tika .
docker run docker-tika
但它报错如下:
[~/Documents/BERT_DV/Docker_Parser] $ docker run docker-tika
2020-05-08 13:49:52,528 [MainThread ] [INFO ] Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.24/tika-server-1.24.jar to /tmp/tika-server.jar.
2020-05-08 13:50:09,742 [MainThread ] [INFO ] Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.24/tika-server-1.24.jar.md5 to /tmp/tika-server.jar.md5.
2020-05-08 13:50:10,133 [MainThread ] [ERROR] Unable to run java; is it installed?
2020-05-08 13:50:10,134 [MainThread ] [ERROR] Failed to receive startup confirmation from startServer.
2020-05-08 13:50:10,271 [MainThread ] [ERROR] Unable to run java; is it installed?
2020-05-08 13:50:10,271 [MainThread ] [ERROR] Failed to receive startup confirmation from startServer.
runner.py脚本如下:
import tika
tika.initVM()
我有以下两个问题: 1. 我读到我们需要下载 tika-server jar 2. 在 python 脚本中调用 initVM() 在后台启动 tika-server。
我不知道里面少了什么。 Docker文件。感谢帮助!
I have update Docker file with Java as well and still it's complaining about Java
### 1. Get Linux
FROM alpine:3.7
### 2. Get Java via the package manager
RUN apk update \
&& apk upgrade \
&& apk add --no-cache bash \
&& apk add --no-cache --virtual=build-dependencies unzip \
&& apk add --no-cache curl \
&& apk add --no-cache openjdk8-jre
ENV JAVA_HOME=/opt/java/openjdk \
PATH="/opt/java/openjdk/bin:$PATH"
###3. Get ython
FROM python:3
RUN pip3 install --upgrade pip requests
RUN pip3 install python-docx tika numpy pandas
RUN mkdir scripts
RUN mkdir pdfs
RUN mkdir output
ADD runner2.py /scripts/
ADD sample.pdf .
CMD [ "python", "./scripts/runner2.py" ]
猫runner2.py:
#!/usr/bin/env python
import tika
from tika import parser
parsed = parser.from_file('sample.pdf')
print(parsed["metadata"])
print(parsed["content"])
[~/Documents/BERT_DV/Docker_Parser] $ docker 运行 docker-提卡
2020-05-08 14:40:23,183 [MainThread ] [INFO ] Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.24/tika-server-1.24.jar to /tmp/tika-server.jar.
2020-05-08 14:41:00,480 [MainThread ] [INFO ] Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.24/tika-server-1.24.jar.md5 to /tmp/tika-server.jar.md5.
2020-05-08 14:41:02,324 [MainThread ] [ERROR] Unable to run java; is it installed?
2020-05-08 14:41:02,324 [MainThread ] [ERROR] Failed to receive startup confirmation from startServer.
来自 tika-s github:
要使用此库,您需要在系统上安装 Java 7+,因为 tika-python 在后台启动 Tika REST 服务器。
所以你需要java,但是python:3
图像中没有java。
有一些解决方案
- 查找 python 并安装 tika docker 图片
- 使用单独的图像
- 在 python:3 上手动安装 java,将 java 安装命令添加到您的 Dockerfile
- 在 java 图像上安装 python
我没有评论的名誉,所以在这里发帖。
看来,您的 Dockerfile 现在正在制作 multi-stage build,Java 不再处于最后阶段 - 之前的阶段已被删除。
正如 Giga Kokaia 之前和其他人所说,Java 是问题所在。看起来你想用单个 Dockerfile 来做。例如,可以通过将 Alpine 作为基础映像来实现,但是您需要一些额外的依赖项才能安装 Python 和所需的 pip 包。当与许多库一起使用时,Alpine 可能不是 Python 的最佳基础,因为它不使用 libc 库。然而,这里是非常粗略更新的 Dockerfile:
### 1. Get Linux
FROM alpine:3.7
### 2. Get Java via the package manager
RUN apk update \
&& apk upgrade \
&& apk add --no-cache bash \
&& apk add --no-cache --virtual=build-dependencies unzip \
&& apk add --no-cache curl \
&& apk add --no-cache openjdk8-jre \
&& apk add python3 python3-dev gcc g++ gfortran musl-dev libxml2-dev libxslt-dev
ENV JAVA_HOME=/opt/java/openjdk \
PATH="/opt/java/openjdk/bin:$PATH"
RUN pip3 install --upgrade pip requests
RUN pip3 install python-docx wheel tika numpy
RUN pip3 install pandas
RUN mkdir scripts
RUN mkdir pdfs
RUN mkdir output
ADD runner2.py /scripts/
ADD sample.pdf .
CMD [ "python3", "./scripts/runner2.py" ]
我重新发布@anapaulagomes 的评论作为答案,因为这是我在谷歌上搜索的内容 -- 运行 Tika 作为 Docker 容器:
I managed to solve this by using Tika as a separate service (which had better performance than having it in the same image). But instead of running Tika's jar, I consume its API. You only need to configure the environment variables
TIKA_CLIENT_ONLY: 1
andTIKA_SERVER_ENDPOINT: tika:9998
. You can see the Dockerfile and docker-compose.yml here: https://github.com/DadosAbertosDeFeira/maria-quiteria
您可以使用
启动Tika服务docker run --rm -t -d --name my_tika --net my-network \
-p 9998:9998 apache/tika:1.27
或将此添加到您的 docker-compose.yml:
tika:
image: apache/tika
ports:
- "9998:9998"
这允许您调用 from tika import parser
并进行解析,而无需调用 tika.initVM()。