如何在 Apache Nutch 中获取 webgraph 2.x
How to get webgraph in Apache Nutch 2.x
我正在使用 apache nutch 2.3.1 来抓取一些 websites.I 必须找到抓取数据的 webgrapg 但不幸的是没有 class 在此版本中定义为版本 1.x .有人可以指导我吗?
以下是 2.3.1 版本的完整命令行选项(但没有 webgraph)
Usage: nutch COMMAND
where COMMAND is one of:
inject inject new urls into the database
hostinject creates or updates an existing host table from a text file
generate generate new batches to fetch from crawl db
fetch fetch URLs marked during generate
parse parse URLs marked during fetch
updatedb update web table after parsing
updatehostdb update host table after parsing
readdb read/dump records from page database
readhostdb display entries from the hostDB
index run the plugin-based indexer on parsed batches
elasticindex run the elasticsearch indexer - DEPRECATED use the index command instead
solrindex run the solr indexer on parsed batches - DEPRECATED use the index command instead
solrdedup remove duplicates from solr
solrclean remove HTTP 301 and 404 documents from solr - DEPRECATED use the clean command instead
clean remove HTTP 301 and 404 documents and duplicates from indexing backends configured via plugins
parsechecker check the parser for a given url
indexchecker check the indexing filters for a given url
plugin load a plugin and run one of its classes main()
nutchserver run a (local) Nutch server on a user defined port
webapp run a local Nutch web application
junit runs the given JUnit test
or
CLASSNAME run the class named CLASSNAME
不幸的是,此功能尚未添加到 Nutch 的 2.x 分支中,作为一般规则,我相信 1.x 分支具有更多功能并且性能更好(尽管这正在改变)。如果您需要继续使用 2.x,那么我建议您要么自己实现该功能,要么将 links-indexer
插件从 1.x 迁移到 2.x(我相信迁移索引器插件会更容易)。我有这个计划,但找不到时间。
我正在使用 apache nutch 2.3.1 来抓取一些 websites.I 必须找到抓取数据的 webgrapg 但不幸的是没有 class 在此版本中定义为版本 1.x .有人可以指导我吗? 以下是 2.3.1 版本的完整命令行选项(但没有 webgraph)
Usage: nutch COMMAND
where COMMAND is one of:
inject inject new urls into the database
hostinject creates or updates an existing host table from a text file
generate generate new batches to fetch from crawl db
fetch fetch URLs marked during generate
parse parse URLs marked during fetch
updatedb update web table after parsing
updatehostdb update host table after parsing
readdb read/dump records from page database
readhostdb display entries from the hostDB
index run the plugin-based indexer on parsed batches
elasticindex run the elasticsearch indexer - DEPRECATED use the index command instead
solrindex run the solr indexer on parsed batches - DEPRECATED use the index command instead
solrdedup remove duplicates from solr
solrclean remove HTTP 301 and 404 documents from solr - DEPRECATED use the clean command instead
clean remove HTTP 301 and 404 documents and duplicates from indexing backends configured via plugins
parsechecker check the parser for a given url
indexchecker check the indexing filters for a given url
plugin load a plugin and run one of its classes main()
nutchserver run a (local) Nutch server on a user defined port
webapp run a local Nutch web application
junit runs the given JUnit test
or
CLASSNAME run the class named CLASSNAME
不幸的是,此功能尚未添加到 Nutch 的 2.x 分支中,作为一般规则,我相信 1.x 分支具有更多功能并且性能更好(尽管这正在改变)。如果您需要继续使用 2.x,那么我建议您要么自己实现该功能,要么将 links-indexer
插件从 1.x 迁移到 2.x(我相信迁移索引器插件会更容易)。我有这个计划,但找不到时间。