Nutch Crawler 不检索新闻文章内容
Nutch Crawler doesn't retrieve news article content
我试图从链接中抓取新闻文章:-
但我没有将页面中的文本提取到索引 (elasticsearch) 中的内容字段。
抓取结果为:-
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.09492774,
"hits": [
{
"_index": "news",
"_type": "doc",
"_id": "http://www.bloomberg.com/press-releases/2016-07-08/network-1-announces-settlement-of-patent-litigation-with-apple-inc",
"_score": 0.09492774,
"_source": {
"tstamp": "2016-08-04T07:21:59.614Z",
"segment": "20160804125156",
"digest": "d583a81c0c4c7510f5c842ea3b557992",
"host": "www.bloomberg.com",
"boost": "1.0",
"id": "http://www.bloomberg.com/press-releases/2016-07-08/network-1-announces-settlement-of-patent-litigation-with-apple-inc",
"url": "http://www.bloomberg.com/press-releases/2016-07-08/network-1-announces-settlement-of-patent-litigation-with-apple-inc",
"content": ""
}
},
{
"_index": "news",
"_type": "doc",
"_id": "http://www.bloomberg.com/press-releases/2016-07-05/apple-donate-life-america-bring-national-organ-donor-registration-to-iphone",
"_score": 0.009845509,
"_source": {
"tstamp": "2016-08-04T07:22:05.708Z",
"segment": "20160804125156",
"digest": "2a94a32ffffd0e03647928755e055e30",
"host": "www.bloomberg.com",
"boost": "1.0",
"id": "http://www.bloomberg.com/press-releases/2016-07-05/apple-donate-life-america-bring-national-organ-donor-registration-to-iphone",
"url": "http://www.bloomberg.com/press-releases/2016-07-05/apple-donate-life-america-bring-national-organ-donor-registration-to-iphone",
"content": ""
}
}
]
}
}
我们可以注意到内容字段是空的。我在 nutch-site.txt 中尝试了不同的选项。但结果仍然是一样的。请帮我解决这个问题。
只是断章取意的回答,但请尝试使用 Apache ManifoldCF。它提供了弹性搜索的内置连接器,以及更好的历史记录来找出数据未被索引的原因。ManifoldCF 中的连接器部分允许您指定,您的内容应该在哪个字段中被索引。这是一个不错的开源替代品,您可以亲身体验一下。
不知道为什么nutch无法提取文章内容。但是我使用 Jsoup 找到了解决方法。我开发了一个自定义解析过滤器插件,它解析整个文档并在解析器过滤器返回的 ParseResult 中设置解析文本。并通过替换 parse-plugins.xml
中的 parse-html 插件来使用我的自定义解析过滤器
它将是这样的:-
document = Jsoup.parse(new String(content.getContent(),"UTF-8"),content.getUrl());
parse = parseResult.get(content.getUrl());
status = parse.getData().getStatus();
title = document.title();
parseData = new ParseData(status, title,parse.getData().getOutlinks(), parse.getData().getContentMeta(), parse.getData().getParseMeta());
parseResult.put(content.getUrl(), new ParseText(document.body().text()), parseData);
我试图从链接中抓取新闻文章:-
但我没有将页面中的文本提取到索引 (elasticsearch) 中的内容字段。
抓取结果为:-
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.09492774,
"hits": [
{
"_index": "news",
"_type": "doc",
"_id": "http://www.bloomberg.com/press-releases/2016-07-08/network-1-announces-settlement-of-patent-litigation-with-apple-inc",
"_score": 0.09492774,
"_source": {
"tstamp": "2016-08-04T07:21:59.614Z",
"segment": "20160804125156",
"digest": "d583a81c0c4c7510f5c842ea3b557992",
"host": "www.bloomberg.com",
"boost": "1.0",
"id": "http://www.bloomberg.com/press-releases/2016-07-08/network-1-announces-settlement-of-patent-litigation-with-apple-inc",
"url": "http://www.bloomberg.com/press-releases/2016-07-08/network-1-announces-settlement-of-patent-litigation-with-apple-inc",
"content": ""
}
},
{
"_index": "news",
"_type": "doc",
"_id": "http://www.bloomberg.com/press-releases/2016-07-05/apple-donate-life-america-bring-national-organ-donor-registration-to-iphone",
"_score": 0.009845509,
"_source": {
"tstamp": "2016-08-04T07:22:05.708Z",
"segment": "20160804125156",
"digest": "2a94a32ffffd0e03647928755e055e30",
"host": "www.bloomberg.com",
"boost": "1.0",
"id": "http://www.bloomberg.com/press-releases/2016-07-05/apple-donate-life-america-bring-national-organ-donor-registration-to-iphone",
"url": "http://www.bloomberg.com/press-releases/2016-07-05/apple-donate-life-america-bring-national-organ-donor-registration-to-iphone",
"content": ""
}
}
]
}
}
我们可以注意到内容字段是空的。我在 nutch-site.txt 中尝试了不同的选项。但结果仍然是一样的。请帮我解决这个问题。
只是断章取意的回答,但请尝试使用 Apache ManifoldCF。它提供了弹性搜索的内置连接器,以及更好的历史记录来找出数据未被索引的原因。ManifoldCF 中的连接器部分允许您指定,您的内容应该在哪个字段中被索引。这是一个不错的开源替代品,您可以亲身体验一下。
不知道为什么nutch无法提取文章内容。但是我使用 Jsoup 找到了解决方法。我开发了一个自定义解析过滤器插件,它解析整个文档并在解析器过滤器返回的 ParseResult 中设置解析文本。并通过替换 parse-plugins.xml
它将是这样的:-
document = Jsoup.parse(new String(content.getContent(),"UTF-8"),content.getUrl());
parse = parseResult.get(content.getUrl());
status = parse.getData().getStatus();
title = document.title();
parseData = new ParseData(status, title,parse.getData().getOutlinks(), parse.getData().getContentMeta(), parse.getData().getParseMeta());
parseResult.put(content.getUrl(), new ParseText(document.body().text()), parseData);