如何从elasticsearch中过滤stromcrawler数据
How to filter stromcrawler data from elasticsearch
我正在使用 apache-storm 1.2.3 和 elasticsearch 7.5.0。我已经成功地从 3k 新闻网站提取数据并在 Grafana 和 kibana 上进行可视化。我在 content.I 中收到很多垃圾(如广告)已附加 SS of CONTENT。content 谁能建议我如何过滤它们。我正在考虑将 html 内容从 ES 提供给一些 python package.am 我在正确的轨道上,如果没有请给我建议好的解决方案。
提前致谢。
这是爬虫-conf.yaml文件
config:
topology.workers: 1
topology.message.timeout.secs: 300
topology.max.spout.pending: 100
topology.debug: false
fetcher.threads.number: 50
# override the JVM parameters for the workers
topology.worker.childopts: "-Xmx2g -Djava.net.preferIPv4Stack=true"
# mandatory when using Flux
topology.kryo.register:
- com.digitalpebble.stormcrawler.Metadata
# metadata to transfer to the outlinks
# used by Fetcher for redirections, sitemapparser, etc...
# these are also persisted for the parent document (see below)
# metadata.transfer:
# - customMetadataName
# lists the metadata to persist to storage
# these are not transfered to the outlinks
metadata.persist:
- _redirTo
- error.source
- isSitemap
- isFeed
http.agent.name: "Nitesh Singh"
http.agent.version: "1.0"
http.agent.description: "built with StormCrawler Elasticsearch Archetype 1.16"
http.agent.url: "http://someorganization.com/"
http.agent.email: "nite0sh@gmail.com"
# The maximum number of bytes for returned HTTP response bodies.
# The fetched page will be trimmed to 65KB in this case
# Set -1 to disable the limit.
http.content.limit: 65536
# FetcherBolt queue dump => comment out to activate
# if a file exists on the worker machine with the corresponding port number
# the FetcherBolt will log the content of its internal queues to the logs
# fetcherbolt.queue.debug.filepath: "/tmp/fetcher-dump-{port}"
parsefilters.config.file: "parsefilters.json"
urlfilters.config.file: "urlfilters.json"
# revisit a page daily (value in minutes)
# set it to -1 to never refetch a page
fetchInterval.default: 1440
# revisit a page with a fetch error after 2 hours (value in minutes)
# set it to -1 to never refetch a page
fetchInterval.fetch.error: 120
fetchInterval.error: -1
# text extraction for JSoupParserBolt
textextractor.include.pattern:
- DIV[id="maincontent"]
- DIV[itemprop="articleBody"]
- ARTICLE
textextractor.exclude.tags:
- STYLE
- SCRIPT
# custom fetch interval to be used when a document has the key/value in its metadata
# and has been fetched successfully (value in minutes)
# fetchInterval.FETCH_ERROR.isFeed=true: 30
# fetchInterval.isFeed=true: 10
# configuration for the classes extending AbstractIndexerBolt
# indexer.md.filter: "someKey=aValue"
indexer.url.fieldname: "url"
indexer.text.fieldname: "content"
indexer.canonical.name: "canonical"
indexer.md.mapping:
- parse.title=title
- parse.keywords=keywords
- parse.description=description
- domain=domain
# Metrics consumers:
topology.metrics.consumer.register:
- class: "org.apache.storm.metric.LoggingMetricsConsumer"
parallelism.hint: 1
您是否配置了文本提取器?例如
# text extraction for JSoupParserBolt
textextractor.include.pattern:
- DIV[id="maincontent"]
- DIV[itemprop="articleBody"]
- ARTICLE
textextractor.exclude.tags:
- STYLE
- SCRIPT
这会将文本限制为特定元素(如果找到)和/或删除排除中指定的元素。
大多数新闻网站都会使用某种形式的标签来标记主要内容。
您作为元素提供的示例 您可以为其添加模式。
您可以在 ParseFilter 中嵌入各种样板删除库,但它们的准确性差异很大。
我正在使用 apache-storm 1.2.3 和 elasticsearch 7.5.0。我已经成功地从 3k 新闻网站提取数据并在 Grafana 和 kibana 上进行可视化。我在 content.I 中收到很多垃圾(如广告)已附加 SS of CONTENT。content 谁能建议我如何过滤它们。我正在考虑将 html 内容从 ES 提供给一些 python package.am 我在正确的轨道上,如果没有请给我建议好的解决方案。 提前致谢。
这是爬虫-conf.yaml文件
config:
topology.workers: 1
topology.message.timeout.secs: 300
topology.max.spout.pending: 100
topology.debug: false
fetcher.threads.number: 50
# override the JVM parameters for the workers
topology.worker.childopts: "-Xmx2g -Djava.net.preferIPv4Stack=true"
# mandatory when using Flux
topology.kryo.register:
- com.digitalpebble.stormcrawler.Metadata
# metadata to transfer to the outlinks
# used by Fetcher for redirections, sitemapparser, etc...
# these are also persisted for the parent document (see below)
# metadata.transfer:
# - customMetadataName
# lists the metadata to persist to storage
# these are not transfered to the outlinks
metadata.persist:
- _redirTo
- error.source
- isSitemap
- isFeed
http.agent.name: "Nitesh Singh"
http.agent.version: "1.0"
http.agent.description: "built with StormCrawler Elasticsearch Archetype 1.16"
http.agent.url: "http://someorganization.com/"
http.agent.email: "nite0sh@gmail.com"
# The maximum number of bytes for returned HTTP response bodies.
# The fetched page will be trimmed to 65KB in this case
# Set -1 to disable the limit.
http.content.limit: 65536
# FetcherBolt queue dump => comment out to activate
# if a file exists on the worker machine with the corresponding port number
# the FetcherBolt will log the content of its internal queues to the logs
# fetcherbolt.queue.debug.filepath: "/tmp/fetcher-dump-{port}"
parsefilters.config.file: "parsefilters.json"
urlfilters.config.file: "urlfilters.json"
# revisit a page daily (value in minutes)
# set it to -1 to never refetch a page
fetchInterval.default: 1440
# revisit a page with a fetch error after 2 hours (value in minutes)
# set it to -1 to never refetch a page
fetchInterval.fetch.error: 120
fetchInterval.error: -1
# text extraction for JSoupParserBolt
textextractor.include.pattern:
- DIV[id="maincontent"]
- DIV[itemprop="articleBody"]
- ARTICLE
textextractor.exclude.tags:
- STYLE
- SCRIPT
# custom fetch interval to be used when a document has the key/value in its metadata
# and has been fetched successfully (value in minutes)
# fetchInterval.FETCH_ERROR.isFeed=true: 30
# fetchInterval.isFeed=true: 10
# configuration for the classes extending AbstractIndexerBolt
# indexer.md.filter: "someKey=aValue"
indexer.url.fieldname: "url"
indexer.text.fieldname: "content"
indexer.canonical.name: "canonical"
indexer.md.mapping:
- parse.title=title
- parse.keywords=keywords
- parse.description=description
- domain=domain
# Metrics consumers:
topology.metrics.consumer.register:
- class: "org.apache.storm.metric.LoggingMetricsConsumer"
parallelism.hint: 1
您是否配置了文本提取器?例如
# text extraction for JSoupParserBolt
textextractor.include.pattern:
- DIV[id="maincontent"]
- DIV[itemprop="articleBody"]
- ARTICLE
textextractor.exclude.tags:
- STYLE
- SCRIPT
这会将文本限制为特定元素(如果找到)和/或删除排除中指定的元素。
大多数新闻网站都会使用某种形式的标签来标记主要内容。
您作为元素提供的示例 您可以为其添加模式。
您可以在 ParseFilter 中嵌入各种样板删除库,但它们的准确性差异很大。