如何索引从 Apache Nutch 抓取的 "html" 到 Solr？

Question

我想将 Apache Nutch (v1.17) 抓取的网页的源代码编入索引，以便在 Solr (8.6.3) 中编入索引，但不知道该怎么做？至少我得到了一个索引到 Solr content 的准备好的版本（见下文）。

{
  "tstamp":"2020-11-19T08:41:15.908Z",
  "digest":"fdc7532e799d4a3a434be4be67c36bb3b",
  "boost":1.0,
  .
  .
  .
  "content":"Algorithm Engineering Group ....",
 "_version_":16837969286885539843
}

我已经查看了 index-writers.xml，但我仍然不知道该怎么做。也许你知道怎么做。

Answer 1

Nutch index tool 提供了一个命令行选项来索引网页的原始内容：

$> bin/nutch index
...
-addBinaryContent  index raw/binary content in field `binaryContent`
-base64            use Base64 encoding for binary content
...

注意：注意抓取工具可能访问的 PDF 和其他二进制格式！

如何索引从 Apache Nutch 抓取的 "html" 到 Solr？

How to index crawled "html" from Apache Nutch to Solr?

html

indexing

solr

nutch