如何使用 chromedriver 而不是 phantomJS 将 Stormcrawler 设置为 运行?

How do you set up Stormcrawler to run with chromedriver instead of phantomJS?

tutorial here 描述了如何使用 phantomJS 将 Stormcrawler 设置为 运行,但是 phantomJS 似乎无法获取和执行外链 javascript 页面(例如,javascript 链接到直接页面上下文之外的代码)。然而,Chromedriver 似乎能够处理这种情况。如何使用 chromedriver 而不是 phantomJS 将 Stormcrawler 设置为 运行?

您需要遵循的基本步骤是:

  1. 安装最新版本的Chrome和Chrome驱动(以下基于tutorial here):
    # Install Google Chrome
    wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
    sudo apt install ./google-chrome-stable_current_amd64.deb
    
    # Install Chromedriver
    PLATFORM=linux64  # Adjust as necessary depending on your system
    VERSION=$(curl http://chromedriver.storage.googleapis.com/LATEST_RELEASE)
    curl -O http://chromedriver.storage.googleapis.com/$VERSION/chromedriver_$PLATFORM.zip
    unzip chromedriver_linux64.zip
    
    # Move executable into your path
    cp chromedriver /usr/bin/
    
  2. 在您的爬虫配置文件中指定以下 selenium 设置(基于 @JulienNioche here 中的片段),包括 chromedriver 的地址和端口 运行:
    http.protocol.implementation: "com.digitalpebble.stormcrawler.protocol.selenium.RemoteDriverProtocol"
    https.protocol.implementation: "com.digitalpebble.stormcrawler.protocol.selenium.RemoteDriverProtocol"
    selenium.addresses: "http://localhost:9515"
    selenium.setScriptTimeout: 10000
    selenium.pageLoadTimeout: 1000
    selenium.implicitlyWait: 1000
    selenium.capabilities:
      goog:chromeOptions:
        args:
        - "--no-sandbox"
        - "--disable-dev-shm-usage"
        - "--headless"
        - "--disable-gpu"
    
  3. 重建你的 Stormcrawler maven 包:mvn clean install; mvn clean package
    • 仅当您修改了任何源配置文件时才需要,但重新构建也无妨
  4. 在后台启动 chromedriver(默认端口 9515):chromedriver --headless &
  5. [如果连接到 Elasticsearch 则可选] 设置你的 ES 索引,如果还没有的话
  6. 开始你的拓扑(首先在本地模式下测试你的设置;如果它没有崩溃,那么你应该很好地进入远程模式):
    storm jar target/stormcrawler-1.0-SNAPSHOT.jar  org.apache.storm.flux.Flux --local es-crawler.flux --sleep 600000
    

如果执行这些步骤后仍然无法正常工作,则可能是您的某个配置文件存在问题,或者一个或多个工具之间存在版本不兼容问题。在任何情况下,我都在下面提供了一组对我有用的示例配置(截至撰写本文时),我希望它们可能有助于让事情正常进行。


配置示例(使用 chromedriver 设置 stormcrawler-elasticsearch)

撰写此答案时使用的版本:

  • 风暴爬行者 1.17
  • 风暴 1.2.3
  • Selenium 4.0.0-alpha-6(无需单独安装;将在 stormcrawler 1.17 的 maven 构建期间下载并安装)
  • Chrome驱动程序 90.0.4430.24
  • Google Chrome 90.0.4430.93

爬虫-conf.yaml

config:
  topology.workers: 3
  topology.message.timeout.secs: 3000
  topology.max.spout.pending: 100
  topology.debug: true

  fetcher.threads.number: 100

  # override the JVM parameters for the workers
  topology.worker.childopts: "-Xmx2g -Djava.net.preferIPv4Stack=true"

  # mandatory when using Flux
  topology.kryo.register:
    - com.digitalpebble.stormcrawler.Metadata

  # lists the metadata to persist to storage
  # these are not transfered to the outlinks
  metadata.persist:
   - _redirTo
   - error.cause
   - error.source
   - isSitemap
   - isFeed

  http.agent.name: "Anonymous Coward"
  http.agent.version: "1.0"
  http.agent.description: "built with StormCrawler 1.17"
  http.agent.url: "http://someorganization.com/"
  http.agent.email: "someone@someorganization.com"

  # The maximum number of bytes for returned HTTP response bodies.
  # The fetched page will be trimmed to 65KB in this case
  # Set -1 to disable the limit.
  http.content.limit: -1 # default 65536

  parsefilters.config.file: "parsefilters.json"
  urlfilters.config.file: "urlfilters.json"

  # revisit a page daily (value in minutes)
  # set it to -1 to never refetch a page
  fetchInterval.default: 1440

  # revisit a page with a fetch error after 2 hours (value in minutes)
  # set it to -1 to never refetch a page
  fetchInterval.fetch.error: 120

  # never revisit a page with an error (or set a value in minutes)
  fetchInterval.error: -1

  # configuration for the classes extending AbstractIndexerBolt
  # indexer.md.filter: "someKey=aValue"
  indexer.url.fieldname: "url"
  indexer.text.fieldname: "content"
  indexer.canonical.name: "canonical"
  indexer.md.mapping:
  - parse.title=title
  - parse.keywords=keywords
  - parse.description=description
  - domain=domain

  # Metrics consumers:
  topology.metrics.consumer.register:
     - class: "org.apache.storm.metric.LoggingMetricsConsumer"
       parallelism.hint: 1

  http.protocol.implementation: "com.digitalpebble.stormcrawler.protocol.selenium.RemoteDriverProtocol"
  https.protocol.implementation: "com.digitalpebble.stormcrawler.protocol.selenium.RemoteDriverProtocol"
  selenium.addresses: "http://localhost:9515"
  selenium.setScriptTimeout: 10000
  selenium.pageLoadTimeout: 1000
  selenium.implicitlyWait: 1000
  selenium.capabilities:
    goog:chromeOptions:
      args:
      - "--nosandbox"
      - "--disable-dev-shm-usage"
      - "--headless"
      - "--disable-gpu"

es-conf.yaml

config:
  # ES indexer bolt
  es.indexer.addresses: "localhost"
  es.indexer.index.name: "content"
  # es.indexer.pipeline: "_PIPELINE_"
  es.indexer.create: false
  es.indexer.bulkActions: 100
  es.indexer.flushInterval: "2s"
  es.indexer.concurrentRequests: 1

  # ES metricsConsumer
  es.metrics.addresses: "http://localhost:9200"
  es.metrics.index.name: "metrics"

  # ES spout and persistence bolt
  es.status.addresses: "http://localhost:9200"
  es.status.index.name: "status"
  es.status.routing: true
  es.status.routing.fieldname: "key"
  es.status.bulkActions: 500
  es.status.flushInterval: "5s"
  es.status.concurrentRequests: 1

    # spout config #

  # time in secs for which the URLs will be considered for fetching after a ack of fail
  spout.ttl.purgatory: 30

  # Min time (in msecs) to allow between 2 successive queries to ES
  spout.min.delay.queries: 2000

  # Delay since previous query date (in secs) after which the nextFetchDate value will be reset to the current time
  spout.reset.fetchdate.after: 120

  es.status.max.buckets: 50
  es.status.max.urls.per.bucket: 2
  # field to group the URLs into buckets
  es.status.bucket.field: "key"
  # fields to sort the URLs within a bucket
  es.status.bucket.sort.field:
   - "nextFetchDate"
   - "url"
  # field to sort the buckets
  es.status.global.sort.field: "nextFetchDate"

  # CollapsingSpout : limits the deep paging by resetting the start offset for the ES query
  es.status.max.start.offset: 500

  # AggregationSpout : sampling improves the performance on large crawls
  es.status.sample: false

  # max allowed duration of a query in sec
  es.status.query.timeout: -1

  # AggregationSpout (expert): adds this value in mins to the latest date returned in the results and
  # use it as nextFetchDate
  es.status.recentDate.increase: -1
  es.status.recentDate.min.gap: -1

  topology.metrics.consumer.register:
       - class: "com.digitalpebble.stormcrawler.elasticsearch.metrics.MetricsConsumer"
         parallelism.hint: 1

es-crawler.flux

name: "crawler"

includes:
    - resource: true
      file: "/crawler-default.yaml"
      override: false

    - resource: false
      file: "crawler-conf.yaml"
      override: true

    - resource: false
      file: "es-conf.yaml"
      override: true

spouts:
  - id: "spout"
    className: "com.digitalpebble.stormcrawler.elasticsearch.persistence.AggregationSpout"
    parallelism: 10

  - id: "filespout"
    className: "com.digitalpebble.stormcrawler.spout.FileSpout"
    parallelism: 1
    constructorArgs:
      - "."
      - "seeds.txt"
      - true

bolts:
  - id: "filter"
    className: "com.digitalpebble.stormcrawler.bolt.URLFilterBolt"
    parallelism: 3
  - id: "partitioner"
    className: "com.digitalpebble.stormcrawler.bolt.URLPartitionerBolt"
    parallelism: 3
  - id: "fetcher"
    className: "com.digitalpebble.stormcrawler.bolt.FetcherBolt"
    parallelism: 3
  - id: "sitemap"
    className: "com.digitalpebble.stormcrawler.bolt.SiteMapParserBolt"
    parallelism: 3
  - id: "parse"
    className: "com.digitalpebble.stormcrawler.bolt.JSoupParserBolt"
    parallelism: 12
  - id: "index"
    className: "com.digitalpebble.stormcrawler.elasticsearch.bolt.IndexerBolt"
    parallelism: 3
  - id: "status"
    className: "com.digitalpebble.stormcrawler.elasticsearch.persistence.StatusUpdaterBolt"
    parallelism: 3
  - id: "status_metrics"
    className: "com.digitalpebble.stormcrawler.elasticsearch.metrics.StatusMetricsBolt"
    parallelism: 3

streams:
  - from: "spout"
    to: "partitioner"
    grouping:
      type: SHUFFLE

  - from: "spout"
    to: "status_metrics"
    grouping:
      type: SHUFFLE

  - from: "partitioner"
    to: "fetcher"
    grouping:
      type: FIELDS
      args: ["key"]

  - from: "fetcher"
    to: "sitemap"
    grouping:
      type: LOCAL_OR_SHUFFLE

  - from: "sitemap"
    to: "parse"
    grouping:
      type: LOCAL_OR_SHUFFLE

  - from: "parse"
    to: "index"
    grouping:
      type: LOCAL_OR_SHUFFLE

  - from: "fetcher"
    to: "status"
    grouping:
      type: FIELDS
      args: ["url"]
      streamId: "status"

  - from: "sitemap"
    to: "status"
    grouping:
      type: FIELDS
      args: ["url"]
      streamId: "status"

  - from: "parse"
    to: "status"
    grouping:
      type: FIELDS
      args: ["url"]
      streamId: "status"

  - from: "index"
    to: "status"
    grouping:
      type: FIELDS
      args: ["url"]
      streamId: "status"

  - from: "filespout"
    to: "filter"
    grouping:
      type: FIELDS
      args: ["url"]
      streamId: "status"

  - from: "filter"
    to: "status"
    grouping:
      streamId: "status"
      type: CUSTOM
      customClass:
        className: "com.digitalpebble.stormcrawler.util.URLStreamGrouping"
        constructorArgs:
          - "byDomain"

parsefilters.json

{
  "com.digitalpebble.stormcrawler.parse.ParseFilters": [
    {
      "class": "com.digitalpebble.stormcrawler.parse.filter.XPathFilter",
      "name": "XPathFilter",
      "params": {
        "canonical": "//*[@rel=\"canonical\"]/@href",
        "parse.description": [
            "//*[@name=\"description\"]/@content",
            "//*[@name=\"Description\"]/@content"
         ],
        "parse.title": [
            "//TITLE",
            "//META[@name=\"title\"]/@content"
         ],
         "parse.keywords": "//META[@name=\"keywords\"]/@content"
      }
    },
    {
      "class": "com.digitalpebble.stormcrawler.parse.filter.LinkParseFilter",
      "name": "LinkParseFilter",
      "params": {
         "pattern": "//FRAME/@src"
       }
    },
    {
      "class": "com.digitalpebble.stormcrawler.parse.filter.DomainParseFilter",
      "name": "DomainParseFilter",
      "params": {
        "key": "domain",
        "byHost": false
       }
    },
    {
      "class": "com.digitalpebble.stormcrawler.parse.filter.CommaSeparatedToMultivaluedMetadata",
      "name": "CommaSeparatedToMultivaluedMetadata",
      "params": {
        "keys": ["parse.keywords"]
       }
    }
  ]
}

pom.xml

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">

    <modelVersion>4.0.0</modelVersion>
    <groupId>org.rcsb.crawler</groupId>
    <artifactId>stormcrawler</artifactId>
    <version>1.0-SNAPSHOT</version>
    <packaging>jar</packaging>

    <name>stormcrawler</name>

    <properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <stormcrawler.version>1.17</stormcrawler.version>
    </properties>

    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.2</version>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                </configuration>
            </plugin>
            <plugin>
                <groupId>org.codehaus.mojo</groupId>
                <artifactId>exec-maven-plugin</artifactId>
                <version>1.3.2</version>
                <executions>
                    <execution>
                        <goals>
                            <goal>exec</goal>
                        </goals>
                    </execution>
                </executions>
                <configuration>
                    <executable>java</executable>
                    <includeProjectDependencies>true</includeProjectDependencies>
                    <includePluginDependencies>false</includePluginDependencies>
                    <classpathScope>compile</classpathScope>
                </configuration>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>1.3.3</version>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                        <configuration>
                            <createDependencyReducedPom>false</createDependencyReducedPom>
                            <transformers>
                                <transformer
                                    implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer" />
                                <transformer
                                    implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
                                    <mainClass>org.apache.storm.flux.Flux</mainClass>
                                    <manifestEntries>
                                        <Change></Change>
                                        <Build-Date></Build-Date>
                                    </manifestEntries>
                                </transformer>
                            </transformers>
                            <!-- The filters below are necessary if you want to include the Tika
                                module -->
                            <filters>
                                <filter>
                                    <artifact>*:*</artifact>
                                    <excludes>
                                        <exclude>META-INF/*.SF</exclude>
                                        <exclude>META-INF/*.DSA</exclude>
                                        <exclude>META-INF/*.RSA</exclude>
                                    </excludes>
                                </filter>
                                <filter>
                                    <!-- https://issues.apache.org/jira/browse/STORM-2428 -->
                                    <artifact>org.apache.storm:flux-core</artifact>
                                    <excludes>
                                        <exclude>org/apache/commons/**</exclude>
                                        <exclude>org/apache/http/**</exclude>
                                        <exclude>org/yaml/**</exclude>
                                    </excludes>
                                </filter>
                            </filters>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>

    <dependencies>
        <dependency>
            <groupId>com.digitalpebble.stormcrawler</groupId>
            <artifactId>storm-crawler-core</artifactId>
            <version>${stormcrawler.version}</version>
        </dependency>
        <dependency>
            <groupId>com.digitalpebble.stormcrawler</groupId>
            <artifactId>storm-crawler-elasticsearch</artifactId>
            <version>${stormcrawler.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.storm</groupId>
            <artifactId>storm-core</artifactId>
            <version>1.2.3</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.storm</groupId>
            <artifactId>flux-core</artifactId>
            <version>1.2.3</version>
        </dependency>
    </dependencies>
</project>