Tika-Python 库为大型 word 文档抛出读取超时错误

Question

尝试使用 python2.7 中的 Tika-Python 库 (https://github.com/chrismattmann/tika-python) 通过 tika 解析 word 文档（我知道它正在贬值，但其他依赖项很少仅在 python2 中工作）。但是对于一些较大的文档，我无法获得解析后的数据。我正在使用下面的代码片段来解析文档。

headers = {
                "X-Tika-OCRLanguage": "eng",
                'timeout': 300,
                'pool_timeout':  300,
                "X-Tika-OCRTimeout": 300
            }
text_tika = parser.from_file(doc, xmlContent=False, requestOptions={'headers':headers})

此代码片段引发以下错误：

ReadTimeout(ReadTimeoutError("HTTPConnectionPool(host='localhost', port=9998): Read timed out. (read timeout=60)",),)

尝试了各种请求选项来增加读取超时但失败了。有人可以帮忙吗？

Answer 1

我发现了这个问题，感谢存储库所有者@chrismattmann，他指出超时参数应该在 header 参数之外。上面的代码应该像这样工作：

headers = {
            "X-Tika-OCRLanguage": "eng",
            "X-Tika-OCRTimeout": "300"
        }
text_tika = parser.from_file(doc, xmlContent=False, requestOptions={'headers': headers, 'timeout': 300})

Tika-Python 库为大型 word 文档抛出读取超时错误

Tika-Python library throws read timeout error for large word document

python

apache-tika