使用 Nutch hbase 索引数据时 html 的格式丢失

Question

我正在尝试在 Nutch hbase 设置中抓取示例 html 文件，当我检索 NutchDocument (org.apache.nutch.indexer.NutchDocument) 以读取内容时，我正在获取文本格式的数据以下

    tstamp: [1970-01-01T00:00:00.000Z]
    digest: [52e6d9e5e5e96e2cfac7fcd92cd117f8]
    host:   []
    boost:  [1.0]
    id:     [:file/home/file.html]
    title:  [Nutch1]
    url:    [file:///home/file.html]
    content:        [Nutch1 Nutch1 The Apache Nutch PMC are pleased to announce the immediate release of Apache Nutch v1.11, we advise all current users and developers of the 1.X series to upgrade to this release. Nutch-Nutch-Identifies the overall Positive]

但我期待的是 html 的原始内容，而不是文本。

有没有我遗漏的设置？

谢谢

Answer 1

查看 2.x 分支上的 index-html 插件。

此插件可让您索引文档的原始 HTML 内容。默认情况下 Nutch parse/extracts 仅索引文本内容，所有 HTML 标签默认被忽略。

使用 Nutch hbase 索引数据时 html 的格式丢失

Formatting of html is lost when indexing data using Nutch hbase

java

solr

hbase

nutch