Stormcrawler 未从网页中检索所有文本内容

Stormcrawler not retrieving all text content from web page

我正在尝试使用 Stormcrawler 来抓取我们网站上的一组页面,虽然它能够检索和索引页面的一些文本,但它无法捕获页面上的大量其他文本。

我已经使用 here 提供的 Ansible 剧本在服务器 运行 Ubuntu 18.04 上安装了 Zookeeper、Apache Storm 和 Stormcrawler(谢谢你的一百万!)以及 Elasticsearch 和 Kibana。大多数情况下,我使用的是默认配置,但进行了以下更改:

在 re-creating 个新的 ES 索引 运行 mvn clean package 之后,然后启动爬虫拓扑,stormcrawler 开始做它的事情,内容开始出现在 Elasticsearch 中。然而,对于许多页面来说,检索和索引的内容只是页面上所有文本的一个子集,通常不包括我们感兴趣的主要页面文本。

比如下面XML路径中的文字是不是returned/indexed:

<html> <body> <div#maincontentcontainer.container> <div#docs-container> <div> <div.row> <div.col-lg-9.col-md-8.col-sm-12.content-item> <div> <div> <p> (text)

虽然此路径中的文本 返回:

<html> <body> <div> <div.container> <div.row> <p> (text)

除了注释掉所有特定标记包含和排除模式之外,是否需要进行任何其他配置更改?根据我对文档的理解,这些选项的默认设置是强制对整个页面进行索引。

如有任何帮助,我将不胜感激。感谢您提供的优秀软件。

以下是我的配置文件:

crawler-conf.yaml

config:
  topology.workers: 3
  topology.message.timeout.secs: 1000
  topology.max.spout.pending: 100
  topology.debug: false

  fetcher.threads.number: 100

  # override the JVM parameters for the workers
  topology.worker.childopts: "-Xmx2g -Djava.net.preferIPv4Stack=true"

  # mandatory when using Flux
  topology.kryo.register:
    - com.digitalpebble.stormcrawler.Metadata

  # metadata to transfer to the outlinks
  # metadata.transfer:
  # - customMetadataName

  # lists the metadata to persist to storage
  metadata.persist:
   - _redirTo
   - error.cause
   - error.source
   - isSitemap
   - isFeed
   
  http.agent.name: "My crawler"
  http.agent.version: "1.0"
  http.agent.description: ""
  http.agent.url: ""
  http.agent.email: ""

  # The maximum number of bytes for returned HTTP response bodies.
  http.content.limit: -1

  # FetcherBolt queue dump => comment out to activate
  # fetcherbolt.queue.debug.filepath: "/tmp/fetcher-dump-{port}"

  parsefilters.config.file: "parsefilters.json"
  urlfilters.config.file: "urlfilters.json"

  # revisit a page daily (value in minutes)
  fetchInterval.default: 1440

  # revisit a page with a fetch error after 2 hours (value in minutes)
  fetchInterval.fetch.error: 120

  # never revisit a page with an error (or set a value in minutes)
  fetchInterval.error: -1

  # text extraction for JSoupParserBolt
  # textextractor.include.pattern:
  #  - DIV[id="maincontent"]
  #  - DIV[itemprop="articleBody"]
  #  - ARTICLE

  # textextractor.exclude.tags:
  #  - STYLE
  #  - SCRIPT

  # configuration for the classes extending AbstractIndexerBolt
  # indexer.md.filter: "someKey=aValue"
  indexer.url.fieldname: "url"
  indexer.text.fieldname: "content"
  indexer.canonical.name: "canonical"
  indexer.md.mapping:
  - parse.title=title
  - parse.keywords=keywords
  - parse.description=description
  - domain=domain

  # Metrics consumers:
  topology.metrics.consumer.register:
     - class: "org.apache.storm.metric.LoggingMetricsConsumer"
       parallelism.hint: 1

  http.protocol.implementation: "com.digitalpebble.stormcrawler.protocol.selenium.RemoteDriverProtocol"
  https.protocol.implementation: "com.digitalpebble.stormcrawler.protocol.selenium.RemoteDriverProtocol"
  selenium.addresses: "http://localhost:9515"

es-conf.yaml

config:
  # ES indexer bolt
  es.indexer.addresses: "localhost"
  es.indexer.index.name: "content"
  # es.indexer.pipeline: "_PIPELINE_"
  es.indexer.create: false
  es.indexer.bulkActions: 100
  es.indexer.flushInterval: "2s"
  es.indexer.concurrentRequests: 1
  
  # ES metricsConsumer
  es.metrics.addresses: "http://localhost:9200"
  es.metrics.index.name: "metrics"
  
  # ES spout and persistence bolt
  es.status.addresses: "http://localhost:9200"
  es.status.index.name: "status"
  es.status.routing: true
  es.status.routing.fieldname: "key"
  es.status.bulkActions: 500
  es.status.flushInterval: "5s"
  es.status.concurrentRequests: 1
  
    # spout config #
    
  # positive or negative filters parsable by the Lucene Query Parser
  # es.status.filterQuery: 
  #  - "-(key:stormcrawler.net)"
  #  - "-(key:digitalpebble.com)"

  # time in secs for which the URLs will be considered for fetching after a ack of fail
  spout.ttl.purgatory: 30
  
  # Min time (in msecs) to allow between 2 successive queries to ES
  spout.min.delay.queries: 2000

  # Delay since previous query date (in secs) after which the nextFetchDate value will be reset to the current time
  spout.reset.fetchdate.after: 120

  es.status.max.buckets: 50
  es.status.max.urls.per.bucket: 2
  # field to group the URLs into buckets
  es.status.bucket.field: "key"
  # fields to sort the URLs within a bucket
  es.status.bucket.sort.field: 
   - "nextFetchDate"
   - "url"
  # field to sort the buckets
  es.status.global.sort.field: "nextFetchDate"

  # CollapsingSpout : limits the deep paging by resetting the start offset for the ES query 
  es.status.max.start.offset: 500
  
  # AggregationSpout : sampling improves the performance on large crawls
  es.status.sample: false

  # max allowed duration of a query in sec 
  es.status.query.timeout: -1

  # AggregationSpout (expert): adds this value in mins to the latest date returned in the results and
  # use it as nextFetchDate
  es.status.recentDate.increase: -1
  es.status.recentDate.min.gap: -1

  topology.metrics.consumer.register:
       - class: "com.digitalpebble.stormcrawler.elasticsearch.metrics.MetricsConsumer"
         parallelism.hint: 1
         #whitelist:
         #  - "fetcher_counter"
         #  - "fetcher_average.bytes_fetched"
         #blacklist:
         #  - "__receive.*"

es-crawler.flux

name: "crawler"

includes:
    - resource: true
      file: "/crawler-default.yaml"
      override: false

    - resource: false
      file: "crawler-conf.yaml"
      override: true

    - resource: false
      file: "es-conf.yaml"
      override: true

spouts:
  - id: "spout"
    className: "com.digitalpebble.stormcrawler.elasticsearch.persistence.AggregationSpout"
    parallelism: 10

  - id: "filespout"
    className: "com.digitalpebble.stormcrawler.spout.FileSpout"
    parallelism: 1
    constructorArgs:
      - "."
      - "seeds.txt"
      - true

bolts:
  - id: "filter"
    className: "com.digitalpebble.stormcrawler.bolt.URLFilterBolt"
    parallelism: 3
  - id: "partitioner"
    className: "com.digitalpebble.stormcrawler.bolt.URLPartitionerBolt"
    parallelism: 3
  - id: "fetcher"
    className: "com.digitalpebble.stormcrawler.bolt.FetcherBolt"
    parallelism: 3
  - id: "sitemap"
    className: "com.digitalpebble.stormcrawler.bolt.SiteMapParserBolt"
    parallelism: 3
  - id: "parse"
    className: "com.digitalpebble.stormcrawler.bolt.JSoupParserBolt"
    parallelism: 12
  - id: "index"
    className: "com.digitalpebble.stormcrawler.elasticsearch.bolt.IndexerBolt"
    parallelism: 3
  - id: "status"
    className: "com.digitalpebble.stormcrawler.elasticsearch.persistence.StatusUpdaterBolt"
    parallelism: 3
  - id: "status_metrics"
    className: "com.digitalpebble.stormcrawler.elasticsearch.metrics.StatusMetricsBolt"
    parallelism: 3

streams:
  - from: "spout"
    to: "partitioner"
    grouping:
      type: SHUFFLE
      
  - from: "spout"
    to: "status_metrics"
    grouping:
      type: SHUFFLE     

  - from: "partitioner"
    to: "fetcher"
    grouping:
      type: FIELDS
      args: ["key"]

  - from: "fetcher"
    to: "sitemap"
    grouping:
      type: LOCAL_OR_SHUFFLE

  - from: "sitemap"
    to: "parse"
    grouping:
      type: LOCAL_OR_SHUFFLE

  - from: "parse"
    to: "index"
    grouping:
      type: LOCAL_OR_SHUFFLE

  - from: "fetcher"
    to: "status"
    grouping:
      type: FIELDS
      args: ["url"]
      streamId: "status"

  - from: "sitemap"
    to: "status"
    grouping:
      type: FIELDS
      args: ["url"]
      streamId: "status"

  - from: "parse"
    to: "status"
    grouping:
      type: FIELDS
      args: ["url"]
      streamId: "status"

  - from: "index"
    to: "status"
    grouping:
      type: FIELDS
      args: ["url"]
      streamId: "status"

  - from: "filespout"
    to: "filter"
    grouping:
      type: FIELDS
      args: ["url"]
      streamId: "status"

  - from: "filter"
    to: "status"
    grouping:
      streamId: "status"
      type: CUSTOM
      customClass:
        className: "com.digitalpebble.stormcrawler.util.URLStreamGrouping"
        constructorArgs:
          - "byDomain"

parsefilters.json

{
  "com.digitalpebble.stormcrawler.parse.ParseFilters": [
    {
      "class": "com.digitalpebble.stormcrawler.parse.filter.XPathFilter",
      "name": "XPathFilter",
      "params": {
        "canonical": "//*[@rel=\"canonical\"]/@href",
        "parse.description": [
            "//*[@name=\"description\"]/@content",
            "//*[@name=\"Description\"]/@content"
         ],
        "parse.title": [
            "//TITLE",
            "//META[@name=\"title\"]/@content"
         ],
         "parse.keywords": "//META[@name=\"keywords\"]/@content"
      }
    },
    {
      "class": "com.digitalpebble.stormcrawler.parse.filter.LinkParseFilter",
      "name": "LinkParseFilter",
      "params": {
         "pattern": "//FRAME/@src"
       }
    },
    {
      "class": "com.digitalpebble.stormcrawler.parse.filter.DomainParseFilter",
      "name": "DomainParseFilter",
      "params": {
        "key": "domain",
        "byHost": false
       }
    },
    {
      "class": "com.digitalpebble.stormcrawler.parse.filter.CommaSeparatedToMultivaluedMetadata",
      "name": "CommaSeparatedToMultivaluedMetadata",
      "params": {
        "keys": ["parse.keywords"]
       }
    }
  ]
}

正在尝试使用 Chromedriver

我为 Ubuntu.

安装了最新版本的 Chromedriver 和 Google Chrome

首先,我以 stormcrawler 用户身份在 localhost:9515 以无头模式启动 chromedriver(通过单独的 python shell,如下所示) ,然后我重新启动 stormcrawler 拓扑(也作为 stormcrawler 用户),但最终出现了一堆与 Chrome 相关的错误。然而奇怪的是,我可以直接在 Python shell 中确认 chromedriver 是 运行 OK,并且我可以确认 driver 和浏览器主动 运行 通过 ps -ef)。当我尝试从命令行(即 chromedriver --headless &)简单地启动 chromedriver 时,也会发生同样的错误堆栈。

以无头模式启动 chromedriver(在 python3 shell)

from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('--no-sandbox')
options.add_argument('--headless')
options.add_argument('--window-size=1200x600')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--disable-setuid-sandbox')
options.add_argument('--disable-extensions')
options.add_argument('--disable-infobars')
options.add_argument('--remote-debugging-port=9222')
options.add_argument('--user-data-dir=/home/stormcrawler/cache/google/chrome')
options.add_argument('--disable-gpu')
options.add_argument('--profile-directory=Default')
options.binary_location = '/usr/bin/google-chrome'
driver = webdriver.Chrome(chrome_options=options, port=9515, executable_path=r'/usr/bin/chromedriver')

从启动 stormcrawler 拓扑开始的堆栈跟踪 运行 命令:storm jar target/stormcrawler-1.0-SNAPSHOT.jar org.apache.storm.flux.Flux --local es-crawler.flux --sleep 60000

9486 [Thread-26-fetcher-executor[3 3]] ERROR o.a.s.util - Async loop died!
java.lang.RuntimeException: org.openqa.selenium.WebDriverException: unknown error: Chrome failed to start: exited abnormally.
  (unknown error: DevToolsActivePort file doesn't exist)
  (The process started from chrome location /usr/bin/google-chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.)
Build info: version: '4.0.0-alpha-6', revision: '5f43a29cfc'
System info: host: 'stormcrawler-dev', ip: '127.0.0.1', os.name: 'Linux', os.arch: 'amd64', os.version: '4.15.0-33-generic', java.version: '1.8.0_282'
Driver info: driver.version: RemoteWebDriver
remote stacktrace: #0 0x55d590b21e89 <unknown>

    at com.digitalpebble.stormcrawler.protocol.selenium.RemoteDriverProtocol.configure(RemoteDriverProtocol.java:101) ~[stormcrawler-1.0-SNAPSHOT.jar:?]
    at com.digitalpebble.stormcrawler.protocol.ProtocolFactory.<init>(ProtocolFactory.java:69) ~[stormcrawler-1.0-SNAPSHOT.jar:?]
    at com.digitalpebble.stormcrawler.bolt.FetcherBolt.prepare(FetcherBolt.java:818) ~[stormcrawler-1.0-SNAPSHOT.jar:?]
    at org.apache.storm.daemon.executor$fn__10180$fn__10193.invoke(executor.clj:803) ~[storm-core-1.2.3.jar:1.2.3]
    at org.apache.storm.util$async_loop$fn__624.invoke(util.clj:482) [storm-core-1.2.3.jar:1.2.3]
    at clojure.lang.AFn.run(AFn.java:22) [clojure-1.7.0.jar:?]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_282]
Caused by: org.openqa.selenium.WebDriverException: unknown error: Chrome failed to start: exited abnormally.
  (unknown error: DevToolsActivePort file doesn't exist)
  (The process started from chrome location /usr/bin/google-chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.)
...

确认 chromedriver 和 chrome 都是 运行 并且可以访问

~/stormcrawler$ ps -ef | grep -i 'driver'
stormcr+ 18862 18857  0 14:28 pts/0    00:00:00 /usr/bin/chromedriver --port=9515
stormcr+ 18868 18862  0 14:28 pts/0    00:00:00 /usr/bin/google-chrome --disable-background-networking --disable-client-side-phishing-detection --disable-default-apps --disable-dev-shm-usage --disable-extensions --disable-gpu --disable-hang-monitor --disable-infobars --disable-popup-blocking --disable-prompt-on-repost --disable-setuid-sandbox --disable-sync --enable-automation --enable-blink-features=ShadowDOMV0 --enable-logging --headless --log-level=0 --no-first-run --no-sandbox --no-service-autorun --password-store=basic --profile-directory=Default --remote-debugging-port=9222 --test-type=webdriver --use-mock-keychain --user-data-dir=/home/stormcrawler/cache/google/chrome --window-size=1200x600
stormcr+ 18899 18877  0 14:28 pts/0    00:00:00 /opt/google/chrome/chrome --type=renderer --no-sandbox --disable-dev-shm-usage --enable-automation --enable-logging --log-level=0 --remote-debugging-port=9222 --test-type=webdriver --allow-pre-commit-input --ozone-platform=headless --field-trial-handle=17069524199442920904,10206176048672570859,131072 --disable-gpu-compositing --enable-blink-features=ShadowDOMV0 --lang=en-US --headless --enable-crash-reporter --lang=en-US --num-raster-threads=1 --renderer-client-id=4 --shared-files=v8_context_snapshot_data:100

~/stormcrawler$ sudo netstat -lp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0 localhost:9222          0.0.0.0:*               LISTEN      18026/google-chrome 
tcp        0      0 localhost:9515          0.0.0.0:*               LISTEN      18020/chromedriver  

IIRC 您需要设置一些 additional config 才能与 ChomeDriver 一起使用。

或者(还没有尝试过)https://hub.docker.com/r/browserless/chrome 是在 Docker 容器中处理 Chrome 的好方法。