如何从 tika-python 库设置 TIKA_SERVER_ENDPOINT
How to set TIKA_SERVER_ENDPOINT from tika-python lib
优秀的 lib tika-python 在其位于 https://github.com/chrismattmann/tika-python 的文档中表明可以设置 tika_server.jar 文件以避免每次使用算法时都下载。有没有人做过,可以post配置一下吗?
第一次使用算法,下载tika_server.jar以便lib可以使用。我想通过在本地设置文件来避免此下载。
从 PDF 中提取文本
def extraiPDF(f):
resultado = []
tika.TikaClientOnly = True
raw = parser.from_file(f)
metadados = raw["metadata"]
conteudo = raw["content"]
conteudo = (conteudo).replace('\n', '').replace('\r\n', '').replace('\r', '').replace('\', '').replace('\t', ' ')
resultado.append(conteudo)
resultado.append(metadados)
return resultado
下载后到 运行 tika 服务器执行此 bash 脚本。
#!/bin/bash
TIKA_PORT=9998
TIKA_HOST=localhost
CURRENT_USER=$(whoami)
TIKA_JAR_URL="http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.19/tika-server-1.19.jar"
TIKA_WORKSPACE=$HOME/tika
TIKA_FILE_NAME="tika_server.jar"
echo -e "Current user: $CURRENT_USER"
if [ ! -f $TIKA_WORKSPACE/$TIKA_FILE_NAME ]; then
echo -e "Downloading tika-server.jar"
if [ ! -d "$TIKA_WORKSPACE" ]; then
echo -e "making tika workspace"
mkdir $TIKA_WORKSPACE
fi
wget -c $TIKA_JAR_URL -O $TIKA_WORKSPACE/$TIKA_FILE_NAME
fi
echo -e "## Setting environment vars"
export TIKA_SERVER_ENDPOINT="http://$TIKA_HOST:$TIKA_PORT"
echo -e "TIKA_SERVER_ENDPOINT to $TIKA_SERVER_ENDPOINT"
export TIKA_CLIENT_ONLY=True
echo -e "TIKA_CLIENT_ONLY to $TIKA_CLIENT_ONLY"
echo -e "## Starting tika server on: $TIKA_WORKSPACE"
cd $TIKA_WORKSPACE
java -jar tika_server.jar -h $TIKA_HOST
优秀的 lib tika-python 在其位于 https://github.com/chrismattmann/tika-python 的文档中表明可以设置 tika_server.jar 文件以避免每次使用算法时都下载。有没有人做过,可以post配置一下吗?
第一次使用算法,下载tika_server.jar以便lib可以使用。我想通过在本地设置文件来避免此下载。
从 PDF 中提取文本
def extraiPDF(f):
resultado = []
tika.TikaClientOnly = True
raw = parser.from_file(f)
metadados = raw["metadata"]
conteudo = raw["content"]
conteudo = (conteudo).replace('\n', '').replace('\r\n', '').replace('\r', '').replace('\', '').replace('\t', ' ')
resultado.append(conteudo)
resultado.append(metadados)
return resultado
下载后到 运行 tika 服务器执行此 bash 脚本。
#!/bin/bash
TIKA_PORT=9998
TIKA_HOST=localhost
CURRENT_USER=$(whoami)
TIKA_JAR_URL="http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.19/tika-server-1.19.jar"
TIKA_WORKSPACE=$HOME/tika
TIKA_FILE_NAME="tika_server.jar"
echo -e "Current user: $CURRENT_USER"
if [ ! -f $TIKA_WORKSPACE/$TIKA_FILE_NAME ]; then
echo -e "Downloading tika-server.jar"
if [ ! -d "$TIKA_WORKSPACE" ]; then
echo -e "making tika workspace"
mkdir $TIKA_WORKSPACE
fi
wget -c $TIKA_JAR_URL -O $TIKA_WORKSPACE/$TIKA_FILE_NAME
fi
echo -e "## Setting environment vars"
export TIKA_SERVER_ENDPOINT="http://$TIKA_HOST:$TIKA_PORT"
echo -e "TIKA_SERVER_ENDPOINT to $TIKA_SERVER_ENDPOINT"
export TIKA_CLIENT_ONLY=True
echo -e "TIKA_CLIENT_ONLY to $TIKA_CLIENT_ONLY"
echo -e "## Starting tika server on: $TIKA_WORKSPACE"
cd $TIKA_WORKSPACE
java -jar tika_server.jar -h $TIKA_HOST