如何使用stormcrawler从网站抓取特定数据

Question

我正在使用 stormcrawler(v 1.16) 抓取新闻网站并将数据存储在 Elasticsearch (v 7.5.0) 上。我的 crawler-conf 文件是 stormcrawler files。我正在使用 kibana visualization.My 问题是

在抓取新闻网站时，我只想要文章内容的网址，但我也得到广告的网址，website.What 上的其他标签以及我必须进行更改的地方 Kibana link
如果我只需要从 URL 中获取特定的内容（例如仅标题或仅内容），我们该怎么做。

编辑：我想在内容索引中添加一个字段。所以我在 src/main/resources/parsefilter.json 、ES_IndecInit.sh 和 Crawler-conf.yaml 中进行了更改。我添加的 XPATH 是正确的。我已添加为

"parse.pubDate":"//META[@itemprop=\"datePublished\"]/@content"

在解析过滤器中。

parse.pubDate =PublishDate

在 crawler-conf 中添加

PublishDate": { "type": "text", "index": false, "store": true}

在 ES_IndexInit.sh 的属性中。但是我仍然没有在 kibana 或 elasticsearch 中得到任何名为 PublishDate 的字段。 ES_IndexInit.sh映射如下：

{
  "mapping": {
    "_source": {
      "enabled": false
    },
    "properties": {
      "PublishDate": {
        "type": "text",
        "index": false,
        "store": true
      },
      "content": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "description": {
        "type": "text",
        "store": true
      },
      "domain": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "host": {
        "type": "keyword",
        "store": true
      },
      "keywords": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "title": {
        "type": "text",
        "store": true
      },
      "url": {
        "type": "keyword",
        "store": true
      }
    }
  }
}

Answer 1

仅索引站点新闻页面的一种方法是依靠站点地图，但并非所有站点都会提供这些。

或者，您需要一种机制作为解析的一部分，可能在 ParseFilter 中，以确定页面是新闻项并在索引期间根据元数据中键/值的存在进行过滤.

news crawl dataset from CommonCrawl 中的做法是种子 URL 是站点地图或 RSS 提要。

要不索引内容，干脆注释掉

  indexer.text.fieldname: "content"

在配置中。

如何使用stormcrawler从网站抓取特定数据

How to crawl specific data from a website using stormcrawler

web-crawler

data-extraction

apache-storm

stormcrawler