使用 solr cell/Tika 元数据导入文件会导致多值错误
Importing files with solr cell/Tika metadata causes a multiple value error
所以我尝试在 Solr 5.4.1 上使用 Solr CEL 和 Tika 为文档编制索引。我正在使用默认配置,但是当我导入我的文档时出现此错误:
multiple values encountered for non multiValued field meta:
这是与错误相关的日志,您可以看到我提供给 solr 的数据。
125973 INFO (qtp840863278-17) [ x:fusearchiver] o.a.s.c.PluginBag Going to create a new requestHandler with {type = requestHandler,name = /update/extract,class = solr.extraction.ExtractingRequestHandler,args = {defaults={lowernames=true,uprefix=ignored_,captureAttr=true,fmap.a=links,fmap.div=ignored_}}}
127134 INFO (qtp840863278-17) [ x:fusearchiver] o.a.s.u.p.LogUpdateProcessorFactory [fusearchiver] webapp=/solr path=/update/extract params={literal.archiveDate_dt=Mon+Apr+03+21:16:48+EDT+2017&literal._accountId=2&literal.categories=taxes&literal.categories=5498&literal.id=b5701a36-0dec-4746-bb5d-3c307a557cd7&literal._batchId=25&literal._type=document&literal._filename=2016-0664-Form-5498.pdf&literal._employeeNumber=1411&wt=javabin&literal._employeeFuseId=1&literal.effectiveDate_dt=Sat+Dec+31+00:00:00+EST+2016&literal._json={"accountId":2,"archiveDate":1491268608431,"batchId":25,"categories":["taxes","5498"],"effectiveDate":1483160400000,"employeeFuseId":1,"employeeNumber":"1411","fileName":"2016-0664-Form-5498.pdf","id":"b5701a36-0dec-4746-bb5d-3c307a557cd7","imageUrl":null,"path":"2016-0664-Form-5498.pdf","uploadedBy":null,"url":null}&version=2} {} 0 1161
127135 ERROR (qtp840863278-17) [ x:fusearchiver] o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: ERROR: [doc=b5701a36-0dec-4746-bb5d-3c307a557cd7] multiple values encountered for non multiValued field meta: [dcterms:modified, 2017-03-16T23:14:41Z, meta:creation-date, 2017-03-16T23:14:41Z, meta:save-date, 2017-03-16T23:14:41Z, pdf:PDFVersion, 1.4, dcterms:created, 2017-03-16T23:14:41Z, Last-Modified, 2017-03-16T23:14:41Z, date, 2017-03-16T23:14:41Z, X-Parsed-By, org.apache.tika.parser.DefaultParser, X-Parsed-By, org.apache.tika.parser.pdf.PDFParser, modified, 2017-03-16T23:14:41Z, xmpTPg:NPages, 2, Creation-Date, 2017-03-16T23:14:41Z, pdf:encrypted, false, created, Thu Mar 16 23:14:41 UTC 2017, stream_size, null, dc:format, application/pdf; version=1.4, producer, Ricoh Americas Corporation, AFP2PDF, Content-Type, application/pdf, xmp:CreatorTool, Ricoh Americas Corporation, AFP2PDF Plus Version: 1.014.10, Last-Save-Date, 2017-03-16T23:14:41Z]
at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:92)
at org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:83)
at org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:273)
at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:207)
at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:169)
at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69)
at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:49)
at org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:924)
at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1079)
at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:702)
at org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:104)
at org.apache.solr.handler.extraction.ExtractingDocumentLoader.doAdd(ExtractingDocumentLoader.java:126)
at org.apache.solr.handler.extraction.ExtractingDocumentLoader.addDoc(ExtractingDocumentLoader.java:131)
at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:237)
at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:70)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:156)
这是我的solrconfig.xml提取模块:
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="lowernames">true</str>
<str name="uprefix">ignored_</str>
<!-- capture link hrefs but ignore div attributes -->
<str name="captureAttr">true</str>
<str name="fmap.a">links</str>
<str name="fmap.div">ignored_</str>
</lst>
</requestHandler>
我认为这基本上会将所有不是字段的内容都标记为忽略,因此不应导入元数据。我搜索了我的 solr 模式,但我没有声明元字段,因此我认为 CEL 会把它扔掉。
我正在使用 Solrj 导入文档。我还在文档中添加了很多文字。您可以在上方看到我以文字形式提供的数据。
为什么我会看到这个错误?
我能否让它仅提取信息并将其放入文本字段并让它以相同的方式处理 HTML 来解决此问题?
这个问题的解决方法是在提取的请求处理程序配置中的 solrconfig.xml 中引入以下内容:
<str name="fmap.meta">ignored_</str>
我不知道为什么我必须明确地这样做。我还必须转 set lowernames
= false 因为我的文字被改变了,这给我带来了严重的问题。这让我确信我应该 运行 Tika 在 Solr 之外,因为我可以更好地控制它。我最终想添加 tesseract,而且自己做似乎更容易。
所以我尝试在 Solr 5.4.1 上使用 Solr CEL 和 Tika 为文档编制索引。我正在使用默认配置,但是当我导入我的文档时出现此错误:
multiple values encountered for non multiValued field meta:
这是与错误相关的日志,您可以看到我提供给 solr 的数据。
125973 INFO (qtp840863278-17) [ x:fusearchiver] o.a.s.c.PluginBag Going to create a new requestHandler with {type = requestHandler,name = /update/extract,class = solr.extraction.ExtractingRequestHandler,args = {defaults={lowernames=true,uprefix=ignored_,captureAttr=true,fmap.a=links,fmap.div=ignored_}}}
127134 INFO (qtp840863278-17) [ x:fusearchiver] o.a.s.u.p.LogUpdateProcessorFactory [fusearchiver] webapp=/solr path=/update/extract params={literal.archiveDate_dt=Mon+Apr+03+21:16:48+EDT+2017&literal._accountId=2&literal.categories=taxes&literal.categories=5498&literal.id=b5701a36-0dec-4746-bb5d-3c307a557cd7&literal._batchId=25&literal._type=document&literal._filename=2016-0664-Form-5498.pdf&literal._employeeNumber=1411&wt=javabin&literal._employeeFuseId=1&literal.effectiveDate_dt=Sat+Dec+31+00:00:00+EST+2016&literal._json={"accountId":2,"archiveDate":1491268608431,"batchId":25,"categories":["taxes","5498"],"effectiveDate":1483160400000,"employeeFuseId":1,"employeeNumber":"1411","fileName":"2016-0664-Form-5498.pdf","id":"b5701a36-0dec-4746-bb5d-3c307a557cd7","imageUrl":null,"path":"2016-0664-Form-5498.pdf","uploadedBy":null,"url":null}&version=2} {} 0 1161
127135 ERROR (qtp840863278-17) [ x:fusearchiver] o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: ERROR: [doc=b5701a36-0dec-4746-bb5d-3c307a557cd7] multiple values encountered for non multiValued field meta: [dcterms:modified, 2017-03-16T23:14:41Z, meta:creation-date, 2017-03-16T23:14:41Z, meta:save-date, 2017-03-16T23:14:41Z, pdf:PDFVersion, 1.4, dcterms:created, 2017-03-16T23:14:41Z, Last-Modified, 2017-03-16T23:14:41Z, date, 2017-03-16T23:14:41Z, X-Parsed-By, org.apache.tika.parser.DefaultParser, X-Parsed-By, org.apache.tika.parser.pdf.PDFParser, modified, 2017-03-16T23:14:41Z, xmpTPg:NPages, 2, Creation-Date, 2017-03-16T23:14:41Z, pdf:encrypted, false, created, Thu Mar 16 23:14:41 UTC 2017, stream_size, null, dc:format, application/pdf; version=1.4, producer, Ricoh Americas Corporation, AFP2PDF, Content-Type, application/pdf, xmp:CreatorTool, Ricoh Americas Corporation, AFP2PDF Plus Version: 1.014.10, Last-Save-Date, 2017-03-16T23:14:41Z]
at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:92)
at org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:83)
at org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:273)
at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:207)
at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:169)
at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69)
at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:49)
at org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:924)
at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1079)
at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:702)
at org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:104)
at org.apache.solr.handler.extraction.ExtractingDocumentLoader.doAdd(ExtractingDocumentLoader.java:126)
at org.apache.solr.handler.extraction.ExtractingDocumentLoader.addDoc(ExtractingDocumentLoader.java:131)
at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:237)
at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:70)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:156)
这是我的solrconfig.xml提取模块:
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="lowernames">true</str>
<str name="uprefix">ignored_</str>
<!-- capture link hrefs but ignore div attributes -->
<str name="captureAttr">true</str>
<str name="fmap.a">links</str>
<str name="fmap.div">ignored_</str>
</lst>
</requestHandler>
我认为这基本上会将所有不是字段的内容都标记为忽略,因此不应导入元数据。我搜索了我的 solr 模式,但我没有声明元字段,因此我认为 CEL 会把它扔掉。
我正在使用 Solrj 导入文档。我还在文档中添加了很多文字。您可以在上方看到我以文字形式提供的数据。
为什么我会看到这个错误?
我能否让它仅提取信息并将其放入文本字段并让它以相同的方式处理 HTML 来解决此问题?
这个问题的解决方法是在提取的请求处理程序配置中的 solrconfig.xml 中引入以下内容:
<str name="fmap.meta">ignored_</str>
我不知道为什么我必须明确地这样做。我还必须转 set lowernames
= false 因为我的文字被改变了,这给我带来了严重的问题。这让我确信我应该 运行 Tika 在 Solr 之外,因为我可以更好地控制它。我最终想添加 tesseract,而且自己做似乎更容易。