在不指定文件的情况下从 Tika 中提取文本内容 header

Question

有没有办法在不显式定义 header 的情况下使用 Tika 服务器从文件中提取内容？例如，对于名为 "file.pdf" 的特定文件，如果我这样做

curl -X PUT --data-binary @file.pdf localhost:9998/tika --header "Content-type: application/pdf" > file.txt

我在 "file.txt" 中得到了提取的内容，但是如果我省略了

' --header "Content-type: application/pdf" '

我得到一个空 "file.txt"。

一般来说，有没有一种方法可以通过一条命令自动将文档提交到 tika 服务器并提取 txt 中的内容？

或者我如何使用管道将文件的可能 Tika header 输出答案重定向到此问题开头的命令？

非常感谢社区！

Answer 1

您错误地调用了 Tika 服务器以获取 auto-detection。如 Tika Server wiki page 中所述，要从任何文件（包括 PDF）中提取纯文本，您应该运行 Curl as:

curl -T file.pdf http://localhost:9998/tika --header "Accept: text/plain"

您需要一个接受 header 来告诉 Tika 您希望结果采用哪种格式（纯文本或 HTML 用于文本提取，更多格式可用于元数据）。只要您直接使用 -T 选项发送文件，它的类型将为您auto-detected

Extract text content from Tika without specifying the file header