如何配置 solr 数据导入处理程序来解析维基百科 xml 文档?
How to configure solr dataimport handler to parse wikipedia xml document?
这就是我到目前为止所做的。
我在 solrconfig.xml
中添加了一个请求处理程序,如下所示:
<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">wiki-data-config.xml</str>
</lst>
</requestHandler>
在同一配置目录中,我创建了一个文件 wiki-data-config.xml
,其中包含以下内容,
<dataConfig>
<dataSource type="FileDataSource" encoding="UTF-8" />
<document>
<entity name="page"
pk="id"
processor="XPathEntityProcessor"
stream="true"
forEach="/mediawiki/page/"
url="/home/tanny/Downloads/Data/Wiki/enwiki-20150702-stub-articles8.xml"
flatten="true" >
<field column="id" xpath="/mediawiki/page/id" />
<field column="title" xpath="/mediawiki/page/title" />
<field column="revision" xpath="/mediawiki/page/revision/id" />
<field column="user" xpath="/mediawiki/page/revision/contributor/username" />
<field column="userId" xpath="/mediawiki/page/revision/contributor/id" />
<field column="text" xpath="/mediawiki/page/revision/text" />
<field column="timestamp" xpath="/mediawiki/page/revision/timestamp" dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss'Z'" />
<field column="$skipDoc" regex="^#REDIRECT .*" replaceWith="true" sourceColName="text"/>
</entity>
</document>
</dataconfig>
我的 schema.xml
包含以下内容,
<!-- Tanny edit starts -->
<field name="id" type="int" indexed="true" stored="true" required="true"/>
<field name="title" type="string" indexed="true" stored="false"/>
<field name="revision" type="int" indexed="true" stored="true"/>
<field name="user" type="string" indexed="true" stored="true"/>
<field name="userId" type="int" indexed="true" stored="true"/>
<field name="text" type="text_en" indexed="true" stored="false"/>
<field name="timestamp" type="date" indexed="true" stored="true"/>
<field name="titleText" type="text_en" indexed="true" stored="true"/>
<uniqueKey>id</uniqueKey>
<copyField source="title" dest="titleText"/>
<!-- Tanny edit ends -->
现在重新启动 SOLR 后,我尝试 post WikiMedia XML 数据使用 ./bin/post
脚本按以下方式,
tanny@localhost:~/binaries/solr-5.2.1$ ./bin/post -c core-base-wiki /home/tanny/Downloads/Data/Wiki/enwiki-20150702-stub-articles8.xml
并在控制台打印以下内容
/usr/lib/jvm/java-7-oracle-cloudera//bin/java -classpath /home/tanny/binaries/solr-5.2.1/dist/solr-core-5.2.1.jar -Dauto=yes -Dc=core-base-wiki -Ddata=files org.apache.solr.util.SimplePostTool /home/tanny/Downloads/Data/Wiki/enwiki-20150702-stub-articles8.xml
SimplePostTool version 5.0.0
Posting files to [base] url http://localhost:8983/solr/core-base-wiki/update...
Entering auto mode. File endings considered are xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
POSTing file enwiki-20150702-stub-articles8.xml (application/xml) to [base]
1 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/core-base-wiki/update...
Time spent: 0:00:00.863
但是,当我转到 UI 并查看概览时,它显示已索引 0 个文档。
我不知道我错过了什么配置。任何 help/guidance 将不胜感激。
P.S.: 数据集enwiki-20150702-stub-articles8.xml是从WikiMedia Page下载的。文档中的几行样例提到如下,
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="en">
<siteinfo>
<sitename>Wikipedia</sitename>
<dbname>enwiki</dbname>
<base>https://en.wikipedia.org/wiki/Main_Page</base>
<generator>MediaWiki 1.26wmf11</generator>
<case>first-letter</case>
<namespaces>
<namespace key="-2" case="first-letter">Media</namespace>
<namespace key="829" case="first-letter">Module talk</namespace>
...
...
<namespace key="2600" case="first-letter">Topic</namespace>
</namespaces>
</siteinfo>
<page>
<title>700 (number)</title>
<ns>0</ns>
<id>465001</id>
<revision>
<id>663854862</id>
<parentid>655386821</parentid>
<timestamp>2015-05-24T21:01:24Z</timestamp>
<contributor>
<username>Cnwilliams</username>
<id>10190671</id>
</contributor>
<comment>Disambiguated: [[Tintin]] → [[The Adventures of Tintin]]</comment>
<model>wikitext</model>
<format>text/x-wiki</format>
<text id="669059875" bytes="12464" />
<sha1>q15fslnvlsrgbeo8f6mcyrg00l2d2a5</sha1>
</revision>
</page>
<page>
<title>Canadian federal election, 1957</title>
<ns>0</ns>
<id>465004</id>
<revision>
<id>666418811</id>
<parentid>666417048</parentid>
<timestamp>2015-06-11T01:38:05Z</timestamp>
<contributor>
<username>Wehwalt</username>
<id>458237</id>
</contributor>
<comment>/* Impact */ clarify</comment>
<model>wikitext</model>
<format>text/x-wiki</format>
<text id="671713242" bytes="77788" />
<sha1>05g14m9sfavo7buuirpr8lx4c6vfwee</sha1>
</revision>
</page>
...
...
<page>
<title>Professional Players Tournament (snooker)</title>
<ns>0</ns>
<id>665001</id>
<redirect title="World Open (snooker)" />
<revision>
<id>359952698</id>
<parentid>25566787</parentid>
<timestamp>2010-05-03T23:48:34Z</timestamp>
<contributor>
<username>Xqbot</username>
<id>8066546</id>
</contributor>
<minor/>
<comment>Robot: Fixing double redirect to [[World Open (snooker)]]</comment>
<model>wikitext</model>
<format>text/x-wiki</format>
<text id="360810125" bytes="34" />
<sha1>lxtjwcda9vk58fphj8ie2logjm607mv</sha1>
</revision>
</page>
</mediawiki>
在我尝试使用命令“curl http://localhost:8983/solr/core-base-wiki/dataimport?command=full-import”获取数据后,数据被编入索引。
不知何故 ./bin/post
无法做到同样的事情。没有在同一方面进行更多研究,如果其他人知道如何做,请您分享您的发现。
您在 solrconfig.xml.
中缺少 lib 元素
<lib dir="../../../dist" regex="solr-dataimporthandler-.*\.jar" />
这就是我到目前为止所做的。
我在 solrconfig.xml
中添加了一个请求处理程序,如下所示:
<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">wiki-data-config.xml</str>
</lst>
</requestHandler>
在同一配置目录中,我创建了一个文件 wiki-data-config.xml
,其中包含以下内容,
<dataConfig>
<dataSource type="FileDataSource" encoding="UTF-8" />
<document>
<entity name="page"
pk="id"
processor="XPathEntityProcessor"
stream="true"
forEach="/mediawiki/page/"
url="/home/tanny/Downloads/Data/Wiki/enwiki-20150702-stub-articles8.xml"
flatten="true" >
<field column="id" xpath="/mediawiki/page/id" />
<field column="title" xpath="/mediawiki/page/title" />
<field column="revision" xpath="/mediawiki/page/revision/id" />
<field column="user" xpath="/mediawiki/page/revision/contributor/username" />
<field column="userId" xpath="/mediawiki/page/revision/contributor/id" />
<field column="text" xpath="/mediawiki/page/revision/text" />
<field column="timestamp" xpath="/mediawiki/page/revision/timestamp" dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss'Z'" />
<field column="$skipDoc" regex="^#REDIRECT .*" replaceWith="true" sourceColName="text"/>
</entity>
</document>
</dataconfig>
我的 schema.xml
包含以下内容,
<!-- Tanny edit starts -->
<field name="id" type="int" indexed="true" stored="true" required="true"/>
<field name="title" type="string" indexed="true" stored="false"/>
<field name="revision" type="int" indexed="true" stored="true"/>
<field name="user" type="string" indexed="true" stored="true"/>
<field name="userId" type="int" indexed="true" stored="true"/>
<field name="text" type="text_en" indexed="true" stored="false"/>
<field name="timestamp" type="date" indexed="true" stored="true"/>
<field name="titleText" type="text_en" indexed="true" stored="true"/>
<uniqueKey>id</uniqueKey>
<copyField source="title" dest="titleText"/>
<!-- Tanny edit ends -->
现在重新启动 SOLR 后,我尝试 post WikiMedia XML 数据使用 ./bin/post
脚本按以下方式,
tanny@localhost:~/binaries/solr-5.2.1$ ./bin/post -c core-base-wiki /home/tanny/Downloads/Data/Wiki/enwiki-20150702-stub-articles8.xml
并在控制台打印以下内容
/usr/lib/jvm/java-7-oracle-cloudera//bin/java -classpath /home/tanny/binaries/solr-5.2.1/dist/solr-core-5.2.1.jar -Dauto=yes -Dc=core-base-wiki -Ddata=files org.apache.solr.util.SimplePostTool /home/tanny/Downloads/Data/Wiki/enwiki-20150702-stub-articles8.xml
SimplePostTool version 5.0.0
Posting files to [base] url http://localhost:8983/solr/core-base-wiki/update...
Entering auto mode. File endings considered are xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
POSTing file enwiki-20150702-stub-articles8.xml (application/xml) to [base]
1 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/core-base-wiki/update...
Time spent: 0:00:00.863
但是,当我转到 UI 并查看概览时,它显示已索引 0 个文档。 我不知道我错过了什么配置。任何 help/guidance 将不胜感激。
P.S.: 数据集enwiki-20150702-stub-articles8.xml是从WikiMedia Page下载的。文档中的几行样例提到如下,
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="en">
<siteinfo>
<sitename>Wikipedia</sitename>
<dbname>enwiki</dbname>
<base>https://en.wikipedia.org/wiki/Main_Page</base>
<generator>MediaWiki 1.26wmf11</generator>
<case>first-letter</case>
<namespaces>
<namespace key="-2" case="first-letter">Media</namespace>
<namespace key="829" case="first-letter">Module talk</namespace>
...
...
<namespace key="2600" case="first-letter">Topic</namespace>
</namespaces>
</siteinfo>
<page>
<title>700 (number)</title>
<ns>0</ns>
<id>465001</id>
<revision>
<id>663854862</id>
<parentid>655386821</parentid>
<timestamp>2015-05-24T21:01:24Z</timestamp>
<contributor>
<username>Cnwilliams</username>
<id>10190671</id>
</contributor>
<comment>Disambiguated: [[Tintin]] → [[The Adventures of Tintin]]</comment>
<model>wikitext</model>
<format>text/x-wiki</format>
<text id="669059875" bytes="12464" />
<sha1>q15fslnvlsrgbeo8f6mcyrg00l2d2a5</sha1>
</revision>
</page>
<page>
<title>Canadian federal election, 1957</title>
<ns>0</ns>
<id>465004</id>
<revision>
<id>666418811</id>
<parentid>666417048</parentid>
<timestamp>2015-06-11T01:38:05Z</timestamp>
<contributor>
<username>Wehwalt</username>
<id>458237</id>
</contributor>
<comment>/* Impact */ clarify</comment>
<model>wikitext</model>
<format>text/x-wiki</format>
<text id="671713242" bytes="77788" />
<sha1>05g14m9sfavo7buuirpr8lx4c6vfwee</sha1>
</revision>
</page>
...
...
<page>
<title>Professional Players Tournament (snooker)</title>
<ns>0</ns>
<id>665001</id>
<redirect title="World Open (snooker)" />
<revision>
<id>359952698</id>
<parentid>25566787</parentid>
<timestamp>2010-05-03T23:48:34Z</timestamp>
<contributor>
<username>Xqbot</username>
<id>8066546</id>
</contributor>
<minor/>
<comment>Robot: Fixing double redirect to [[World Open (snooker)]]</comment>
<model>wikitext</model>
<format>text/x-wiki</format>
<text id="360810125" bytes="34" />
<sha1>lxtjwcda9vk58fphj8ie2logjm607mv</sha1>
</revision>
</page>
</mediawiki>
在我尝试使用命令“curl http://localhost:8983/solr/core-base-wiki/dataimport?command=full-import”获取数据后,数据被编入索引。
不知何故 ./bin/post
无法做到同样的事情。没有在同一方面进行更多研究,如果其他人知道如何做,请您分享您的发现。
您在 solrconfig.xml.
中缺少 lib 元素<lib dir="../../../dist" regex="solr-dataimporthandler-.*\.jar" />