"zip bomb" 向 Solr 发送 HTML 文档时出现异常

"zip bomb" exception while sending HTML document to Solr

我正在向 Solr 发送 HTML 文档,而 Tika 正在抛出 "Zip bomb detected!" 异常。 Solr 日志报告:"Suspected zip bomb: 100 levels of XML element nesting"

查看 Tika 源代码,XML 元素嵌套有 100 层的任意限制(See here)。

有问题的变量 (maxDepth) 确实有一个 public setter 函数,但我不确定是否可以在 Solr 上设置它。可能吗?

这是完整的堆栈跟踪:

2018-04-05 16:47:48.034 ERROR (qtp1654589030-15) [   x:aconn] o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: Zip bomb detected!
    at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:234)
    at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
    at ca.calgary.csc.wds.solr.GsaAconnRequestHandler.handleRequestBody(GsaAconnRequestHandler.java:84)
    at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:177)
    at org.apache.solr.core.SolrCore.execute(SolrCore.java:2503)
    at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:710)
    at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:516)
    at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:382)
    at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:326)
    at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1751)
    at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
    at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
    at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
    at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
    at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
    at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
    at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
    at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
    at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
    at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
    at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
    at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
    at org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
    at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
    at org.eclipse.jetty.server.Server.handle(Server.java:534)
    at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
    at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
    at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
    at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108)
    at org.eclipse.jetty.io.SelectChannelEndPoint.run(SelectChannelEndPoint.java:93)
    at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
    at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
    at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
    at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
    at org.eclipse.jetty.util.thread.QueuedThreadPool.run(QueuedThreadPool.java:589)
    at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.tika.exception.TikaException: Zip bomb detected!
    at org.apache.tika.sax.SecureContentHandler.throwIfCauseOf(SecureContentHandler.java:192)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:138)
    at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228)
    ... 35 more
Caused by: org.apache.tika.sax.SecureContentHandler$SecureSAXException: Suspected zip bomb: 100 levels of XML element nesting
    at org.apache.tika.sax.SecureContentHandler.startElement(SecureContentHandler.java:234)
    at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
    at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
    at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
    at org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
    at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:255)
    at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:297)
    at org.apache.tika.parser.html.HtmlHandler.startElementWithSafeAttributes(HtmlHandler.java:251)
    at org.apache.tika.parser.html.HtmlHandler.startElement(HtmlHandler.java:167)
    at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
    at org.apache.tika.parser.html.XHTMLDowngradeHandler.startElement(XHTMLDowngradeHandler.java:60)
    at org.ccil.cowan.tagsoup.Parser.push(Parser.java:794)
    at org.ccil.cowan.tagsoup.Parser.rectify(Parser.java:1061)
    at org.ccil.cowan.tagsoup.Parser.stagc(Parser.java:1016)
    at org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:625)
    at org.ccil.cowan.tagsoup.Parser.parse(Parser.java:449)
    at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:135)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
    ... 36 more

编辑:我发现一个 Jira issue 似乎是由类似的方式引起的。 Tim Allison 给出的解决方案是使用 Tika 的默认 HTML 映射器而不是 Solr 的映射器。 如何在 Solr 配置中进行设置?

Edit2:我已验证这不是 Tika 问题,因为 tika-app jar 能够成功提取文件内容

>java -jar tika-app-1.16.jar -t test.html

根据 Tim 的说法,无法通过 Solr 配置进行设置。作为替代方案,我发现在其他地方提到的建议是 运行 Solr 之外的 Tika,即不使用 Solr Cell