在 "Bluemix" solr 中索引 nutch 抓取的数据

Question

我正在尝试为 Bluemix solr 抓取的数据编制索引，但无论如何我都找不到这样做的方法。我的主要问题是：有没有人可以帮助我这样做？我应该怎么做才能将我抓取的数据的结果发送到我的 Blumix Solr。对于爬行，我使用了 nutch 1.11，这是我现在所做的一部分以及我面临的问题：我认为可能有两种可能的解决方案：

通过 nutch 命令：

“NUTCH_PATH/bin/nutch index crawl/crawldb -linkdb crawl/linkdb crawl/ -Dsolr.server.url="OURSOLRURL"”

我可以通过 OURSOLR 索引 nutch 爬取的数据。但是，我发现了一些问题。

a-虽然听起来很奇怪，但还是不能接受URL。我可以使用 URL 的编码来处理它。

b-由于我必须连接到特定的用户名和密码，nutch 无法连接到我的 solr。考虑到这一点：

 Active IndexWriters :
 SolrIndexWriter
    solr.server.type : Type of SolrServer to communicate with (default 'http' however options include 'cloud', 'lb' and 'concurrent')
    solr.server.url : URL of the Solr instance (mandatory)
    solr.zookeeper.url : URL of the Zookeeper URL (mandatory if 'cloud' value for solr.server.type)
    solr.loadbalance.urls : Comma-separated string of Solr server strings to be used (madatory if 'lb' value for solr.server.type)
    solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
    solr.commit.size : buffer size when sending to Solr (default 1000)
    solr.auth : use authentication (default false)
    solr.auth.username : username for authentication
    solr.auth.password : password for authentication

在命令行输出中，我试图通过使用命令的身份验证参数来解决这个问题 "solr.auth=true solr.auth.username="SOLR-UserName" solr.auth.password="Pass" 到它。

到目前为止，我已经知道要使用这个命令了：

”bin/nutch index crawl/crawldb -linkdb crawl/linkdb crawl/segments/2016* solr.server.url="https%3A%2F%2Fgateway.watsonplatform.net%2Fretrieve-and-rank%2Fapi%2Fv1%2Fsolr_clusters%2FCLUSTER-ID%2Fsolr%2Fadmin%2Fcollections" solr.auth=true solr.auth.username="USERNAME" solr.auth.password="PASS"“.

但是由于某种我还没有意识到的原因，该命令将身份验证参数视为已抓取的数据目录并且不起作用。所以我想这不是 "Active IndexWriters" 的正确方法谁能告诉我那我该怎么办？？

通过 curl 命令：

“curl -X POST -H "Content-Type: application/json" -u "BLUEMIXSOLR-USERNAME":"BLUEMIXSOLR-PASS" "https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/CLUSTERS-ID/solr/example_collection/update" --data-binary @{/path_to_file}/FILE.json”

我想也许我可以提供 json 由这个命令创建的文件：

bin/nutch commoncrawldump -outputDir finalcrawlResult/ -segment crawl/segments -gzip -extension json -SimpleDateFormat -epochFilename -jsonArray -reverseKey but there are some problems here.

一个。此命令在复杂的路径中提供了如此多的文件，这将花费大量时间手动 post 所有 them.I 猜测对于大的 callings 它甚至可能是不可能的。有什么方法可以只用一个命令一次性 POST 一个目录及其子目录中的所有文件吗？？

b。由 commoncrawldump 创建的 json 个文件的开头有一个奇怪的名称“ÙÙ÷yœ”。

c。我删除了名称 weird name 并尝试 POST 这些文件中的一个，但结果如下：

 {"responseHeader":{"status":400,"QTime":23},"error":{"metadata":["error-class","org.apache.solr.common.SolrException","root-error-class","org.apache.solr.common.SolrException"],"msg":"Unknown command 'url' at [9]","code":400}}

这是否意味着这些文件无法提供给 Bluemix solr 并且对我来说毫无用处？

Answer 1

感谢 Lewis John Mcgibbney，我意识到应该按以下方式使用索引工具：

bin/nutch index -D solr.server.url="https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/CLUSTER-ID/solr/admin/collections -D solr.auth=true -D solr.auth.username="USERNAME" -D solr.auth.password="PASS" Crawl/crawldb -linkdb Crawl/linkdb Crawl/segments/2016*

意思是：在每个授权参数之前使用 -D 并在工具参数的前面提到这些参数。

Answer 2

要在 Bluemix Retrieve and Rank 服务中索引 nutch 爬取的数据，应该：

用坚果抓取种子，例如

$:bin/crawl -w 5 urls crawl 25

您可以通过以下方式查看抓取状态：

bin/nutch readdb crawl/crawldb/ -stats

将抓取的数据转储为文件：

$:bin/nutch dump -flatdir -outputDir dumpData/ -segment crawl/segments/
Posted 那些可能的，例如 xml 文件到 solr Collection 在检索和排名：

Post_url = '"https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/%s/solr/%s/update"' %(solr_cluster_id, solr_collection_name) cmd ='''curl -X POST -H %s -u %s %s --data-binary @%s''' %(Cont_type_xml, solr_credentials, Post_url, myfilename) subprocess.call(cmd,shell=True)

使用 Bluemix Doc-Conv 服务将其余部分转换为 json：

doc_conv_url = '"https://gateway.watsonplatform.net/document-conversion/api/v1/convert_document?version=2015-12-15"'
cmd ='''curl -X POST -u %s -F config="{\"conversion_target\":\"answer_units\"}" -F file=@%s %s''' %(doc_conv_credentials, myfilename, doc_conv_url)
process = subprocess.Popen(cmd, shell= True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)

然后将这些 Json 结果保存在 json 文件中。

Post 这个 json 文件到 collection:

Post_converted_url = '"https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/%s/solr/%s/update/json/docs?commit=true&split=/answer_units/id&f=id:/answer_units/id&f=title:/answer_units/title&f=body:/answer_units/content/text"' %(solr_cluster_id, solr_collection_name)
cmd ='''curl -X POST -H %s -u %s %s --data-binary @%s''' %(Cont_type_json, solr_credentials, Post_converted_url, Path_jsonFile)
subprocess.call(cmd,shell=True)

发送查询：

pysolr_client = retrieve_and_rank.get_pysolr_client(solr_cluster_id, solr_collection_name)
results = pysolr_client.search(Query_term)
print(results.docs)

代码在 python 中。对于初学者：您可以直接在 CMD 中使用 curl 命令。希望对你有所帮助

在 "Bluemix" solr 中索引 nutch 抓取的数据

Indexing nutch crawled data in "Bluemix" solr

json

solr

nutch

retrieve-and-rank

ibm-cloud