带有 SQL 外部模块的 Stormcrawl 在 crawl sage 中获取 ParseFilters 异常

Stormcrawl with SQL external module gets ParseFilters exception at crawl sage

我将 Stromcrawler 与 SQL 外部模块一起使用。我已将 pop.xml 更新为:

<dependency>
        <groupId>com.digitalpebble.stormcrawler</groupId>
        <artifactId>storm-crawler-sql</artifactId>
        <version>1.8</version>
</dependency>

我使用与 ES 设置类似的 injector/crawl 过程:

storm jar target/stromcrawler-1.0-SNAPSHOT.jar  org.apache.storm.flux.Flux --local sql-injector.flux --sleep 864000

我已经创建了 mysql 数据库 crawl、table urls 并成功地将我的网址注入其中。例如,如果我执行 select * from crawl.urls limit 5;,我可以看到 url、状态和其他字段。由此,我得出结论,在这个阶段,爬虫连接到了数据库。

Sql-注入器看起来像这样:

name: "injector"

includes:
- resource: true
  file: "/crawler-default.yaml"
  override: false

- resource: false
  file: "crawler-conf.yaml"
  override: true

- resource: false
  file: "sql-conf.yaml"
  override: true

- resource: false
  file: "my-config.yaml"
  override: true

components:
 - id: "scheme"
className: "com.digitalpebble.stormcrawler.util.StringTabScheme"
constructorArgs:
  - DISCOVERED

spouts:
 - id: "spout"
  className: "com.digitalpebble.stormcrawler.spout.FileSpout"
parallelism: 1
constructorArgs:
  - "seeds.txt"
  - ref: "scheme"

bolts:
- id: "status"
className: "com.digitalpebble.stormcrawler.sql.StatusUpdaterBolt"
parallelism: 1

streams:
 - from: "spout"
to: "status"
grouping:
  type: CUSTOM
  customClass:
    className: "com.digitalpebble.stormcrawler.util.URLStreamGrouping"
    constructorArgs:
      - "byHost"

当我运行:

storm jar target/stromcrawler-1.0-SNAPSHOT.jar  org.apache.storm.flux.Flux --remote sql-crawler.flux

我在 Parse 螺栓处遇到以下异常:

java.lang.RuntimeException: Exception caught while loading the ParseFilters from parsefilters.json at com.digitalpebble.stormcrawler.parse.ParseFilters.fromConf(ParseFilters.java:67) at com.digitalpebble.stormcrawler.bolt.JSoupParserBolt.prepare(JSoupParserBolt.java:116) at org.apache.storm.daemon.executor$fn__5043$fn__5056.invoke(executor.clj:803) at org.apache.storm.util$async_loop$fn__557.invoke(util.clj:482) at clojure.lang.AFn.run(AFn.java:22) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Unable to build JSON object from file at com.digitalpebble.stormcrawler.parse.ParseFilters.(ParseFilters.java:92) at com.digitalpebble.stormcrawler.parse.ParseFilters.fromConf(ParseFilters.java:62) ... 5 more Caused by: com.fasterxml.jackson.core.JsonParseException: Unexpected character ('}' (code 125)): was expecting double-quote to start field name...

Screenshot of StormUI

sql-crawler.flux:

name: "crawler"

includes:
- resource: true
  file: "/crawler-default.yaml"
  override: false

- resource: false
  file: "crawler-conf.yaml"
  override: true

- resource: false
  file: "sql-conf.yaml"
  override: true

- resource: false
  file: "my-config.yaml"
  override: true

spouts:
- id: "spout"
className: "com.digitalpebble.stormcrawler.sql.SQLSpout"
parallelism: 100

bolts:
- id: "partitioner"
className: "com.digitalpebble.stormcrawler.bolt.URLPartitionerBolt"
parallelism: 1
- id: "fetcher"
className: "com.digitalpebble.stormcrawler.bolt.FetcherBolt"
parallelism: 1
- id: "sitemap"
className: "com.digitalpebble.stormcrawler.bolt.SiteMapParserBolt"
parallelism: 1
- id: "parse"
className: "com.digitalpebble.stormcrawler.bolt.JSoupParserBolt"
parallelism: 1
- id: "status"
className: "com.digitalpebble.stormcrawler.sql.StatusUpdaterBolt"
parallelism: 1


streams:
- from: "spout"
to: "partitioner"
grouping:
  type: SHUFFLE

- from: "partitioner"
to: "fetcher"
grouping:
  type: FIELDS
  args: ["key"]

- from: "fetcher"
to: "sitemap"
grouping:
  type: LOCAL_OR_SHUFFLE

- from: "sitemap"
to: "parse"
grouping:
  type: LOCAL_OR_SHUFFLE

- from: "fetcher"
to: "status"
grouping:
  type: FIELDS
  args: ["url"]
  streamId: "status"

- from: "sitemap"
to: "status"
grouping:
  type: FIELDS
  args: ["url"]
  streamId: "status"

- from: "parse"
to: "status"
grouping:
  type: FIELDS
  args: ["url"]
  streamId: "status"

看起来 ParseFilters.java:60 处的对象 StringUtils 是空白的。

检查 src/main/resources.parsefilters.json 的内容(或您可能为 parsefilters.config.file[ 设置的任何值) =15=]),从报错信息来看,其中包含的JSON是无效的。不要忘记使用 mvn clean package

重建 uber jar