从网站抓取数据到 hdfs
crawl data from website into hdfs
我想从网站抓取数据,所以我使用 openweather.org 中的 API。
我配置为流式传输数据的代理如下
weather.channels= memory-channel
weather.channels.memory-channel.capacity=10000
weather.channels.memory-channel.type = memory
weather.sinks = hdfs-write
weather.sinks.hdfs-write.channel=memory-channel
weather.sinks.hdfs-write.type = logger
weather.sinks.hdfs-write.hdfs.path = hdfs://localhost:8020/user/hadoop/flume/
weather.sinks.hdfs-write.rollInterval = 1200
weather.sinks.hdfs-write.hdfs.writeFormat=Text
weather.sinks.hdfs-write.hdfs.fileType=DataStream
weather.sources= Weather
weather.sources.Weather.bind = api.openweathermap.org/data/2.5/forecast/city?id=285787&APPID=8ce9bbbe446da25b19242763bdddb90a
weather.sources.Weather.username= abc
weather.sources.Weather.password= ********
weather.sources.Weather.channels=memory-channel
weather.sources.Weather.type = http
weather.sources.Weather.port = 11111
虽然我是 运行 的 flume 特工,但命令如下
flume-ng代理-fweather.conf-n天气
我收到以下错误
15/03/23 05:17:34 INFO node.PollingPropertiesFileConfigurationProvider: Reloading configuration file:weather.conf
15/03/23 05:17:34 INFO conf.FlumeConfiguration: Processing:hdfs-write
15/03/23 05:17:34 INFO conf.FlumeConfiguration: Processing:hdfs-write
15/03/23 05:17:34 INFO conf.FlumeConfiguration: Processing:hdfs-write
15/03/23 05:17:34 INFO conf.FlumeConfiguration: Processing:hdfs-write
15/03/23 05:17:34 INFO conf.FlumeConfiguration: Added sinks: hdfs-write Agent: weather
15/03/23 05:17:34 INFO conf.FlumeConfiguration: Processing:hdfs-write
15/03/23 05:17:34 INFO conf.FlumeConfiguration: Processing:hdfs-write
15/03/23 05:17:34 INFO conf.FlumeConfiguration: Post-validation flume configuration contains configuration for agents: [weather]
15/03/23 05:17:34 INFO node.AbstractConfigurationProvider: Creating channels
15/03/23 05:17:34 INFO channel.DefaultChannelFactory: Creating instance of channel memory-channel type memory
15/03/23 05:17:34 INFO node.AbstractConfigurationProvider: Created channel memory-channel
15/03/23 05:17:34 INFO source.DefaultSourceFactory: Creating instance of sourceWeather, type http
15/03/23 05:17:35 INFO sink.DefaultSinkFactory: Creating instance of sink: hdfs-write, type: logger
15/03/23 05:17:35 INFO node.AbstractConfigurationProvider: Channel memory-channel connected to [Weather, hdfs-write]
15/03/23 05:17:35 INFO node.Application: Starting new configuration:{
sourceRunners:{Weather=EventDrivenSourceRunner: {
source:org.apache.flume.source.http.HTTP
Source{name:Weather,state:IDLE} }} sinkRunners:{hdfs-write=SinkRunner: {
policy:org.apache.flume.sink.DefaultSinkProcessor@529d1dd7 counterGroup:{
name:null counters:{} } }} channels:{memory-
channel=org.apache.flume.channel.MemoryChannel{name: memory-channel}} }
15/03/23 05:17:35 INFO node.Application: Starting Channel memory-channel
15/03/23 05:17:35 INFO instrumentation.MonitoredCounterGroup: Monitored
countergroup for type: CHANNEL, name: memory-channel: Successfully
registered new MBean.
15/03/23 05:17:35 INFO instrumentation.MonitoredCounterGroup: Component
type: CHANNEL, name: memory-channel started
15/03/23 05:17:35 INFO node.Application: Starting Sink hdfs-write
15/03/23 05:17:35 INFO node.Application: Starting Source Weather
15/03/23 05:17:35 INFO mortbay.log: Logging to
org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via
org.mortbay.log.Slf4jLog
15/3/23 05:17:35 INFO mortbay.log: jetty-6.1.26
15/03/23 05:17:36 WARN mortbay.log: failed
SelectChannelConnector@api.openweathermap.org/data/2.5/forecast/city?
id=285787&APPID=8ce9bbbe446da25b19242763bdddb90a:11111:
java.net.SocketException: Unresolved address
15/03/23 05:17:36 WARN mortbay.log: failed Server@642c189d:
java.net.SocketException: Unresolved address
15/03/23 05:17:36 ERROR http.HTTPSource: Error while starting HTTPSource.
Exception follows.java.net.SocketException: Unresolved address
at sun.nio.ch.Net.translateToSocketException(Net.java:157)
at sun.nio.ch.Net.translateException(Net.java:183)
at sun.nio.ch.Net.translateException(Net.java:189)
at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:76)
at org.mortbay.jetty.nio.SelectChannelConnector.open
(SelectChannelConnector.java:216)
at org.mortbay.jetty.nio.SelectChannelConnector.doStart(SelectChannelCon
nector.java:315)
at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java_
at org.mortbay.jetty.Server.doStart(Server.java:235)
at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java)
at org.apache.flume.source.http.HTTPSource.start(HTTPSource.java:220)
at org.apache.flume.source.EventDrivenSourceRunner.start(EventDrivenSour
ceRunner.java:44)
at org.apache.flume.lifecycle.LifecycleSupervisor$MonitorRunnable.run
(LifecycleSupervisor.java:251)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.
access1(ScheduledThreadPoolExecutor.java:178)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.
run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.
java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.nio.channels.UnresolvedAddressException
at sun.nio.ch.Net.checkAddress(Net.java:127)
at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java)
at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
... 15 more
15/03/23 05:17:36 ERROR lifecycle.LifecycleSupervisor: Unable to start
EventDrivenSourceRunner: {
source:org.apache.flume.source.http.HTTPSource{name:Weather,state:IDLE} }
- Exception follows.
java.lang.RuntimeException: java.net.SocketException: Unresolved address
at com.google.common.base.Throwables.propagate(Throwables.java:156)
at org.apache.flume.source.http.HTTPSource.start(HTTPSource.java:224)
at org.apache.flume.source.EventDrivenSourceRunner.start
(EventDrivenSourceRunner.java:44)
at org.apache.flume.lifecycle.LifecycleSupervisor$MonitorRunnable.run(Li
fecycleSupervisor.java:251)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.
access1(ScheduledThreadPoolExecutor.java:178)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.
run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.
java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.SocketException: Unresolved address
at sun.nio.ch.Net.translateToSocketException(Net.java:157)
at sun.nio.ch.Net.translateException(Net.java:183)
at sun.nio.ch.Net.translateException(Net.java:189)
at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:76)
at org.mortbay.jetty.nio.SelectChannelConnector.open(SelectChannelConnec
tor.java:216)
at org.mortbay.jetty.nio.SelectChannelConnector.doStart(SelectChannelCon
nector.java:315)
at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:
at org.mortbay.jetty.Server.doStart(Server.java:235)
at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:
at org.apache.flume.source.http.HTTPSource.start(HTTPSource.java:220)
... 9 more
Caused by: java.nio.channels.UnresolvedAddressException
at sun.nio.ch.Net.checkAddress(Net.java:127)
at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java
at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
... 15 more
15/03/23 05:17:39 ERROR lifecycle.LifecycleSupervisor: Unable to start
EventDrivenSourceRunner: {
source:org.apache.flume.source.http.HTTPSource{name:Weather,state:IDLE}
} - Exception follows.
java.lang.IllegalStateException: Running HTTP Server found in source:
Weather before I started one.Will not attempt to start.
at com.google.common.base.Preconditions.checkState(Preconditions.java:14
at org.apache.flume.source.http.HTTPSource.start(HTTPSource.java:189)
at org.apache.flume.source.EventDrivenSourceRunner.start(EventDrivenSour
ceRunner.java:44)
at org.apache.flume.lifecycle.LifecycleSupervisor$MonitorRunnable.run(Li
fecycleSupervisor.java:251)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.
access1(ScheduledThreadPoolExecutor.java:178)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.
run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.
java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
.java:615)
at java.lang.Thread.run(Thread.java:745)
^C15/03/23 05:17:41 INFO lifecycle.LifecycleSupervisor: Stopping
lifecycle supervisor 10
15/03/23 05:17:41 INFO node.PollingPropertiesFileConfigurationProvider:
Configuration provider stopping
请帮我解决这个问题?
或者我必须在配置 flume 代理之前做其他事情。
或者我应该使用nutch来爬取数据,还是应该使用storm。
请帮助我最好的选择是什么
提前致谢
HTTPSource
的 bind
参数指定您的代理将要侦听数据的 IP 地址或主机名。不是爬虫端点,而是爬虫必须发送数据的端点(连同端口)。
话虽如此,我建议使用 Exec
源来执行一个脚本,该脚本会抓取 openweather.org 并在输出中生成数据;然后将该输出用作代理的输入数据。
我想从网站抓取数据,所以我使用 openweather.org 中的 API。 我配置为流式传输数据的代理如下
weather.channels= memory-channel
weather.channels.memory-channel.capacity=10000
weather.channels.memory-channel.type = memory
weather.sinks = hdfs-write
weather.sinks.hdfs-write.channel=memory-channel
weather.sinks.hdfs-write.type = logger
weather.sinks.hdfs-write.hdfs.path = hdfs://localhost:8020/user/hadoop/flume/
weather.sinks.hdfs-write.rollInterval = 1200
weather.sinks.hdfs-write.hdfs.writeFormat=Text
weather.sinks.hdfs-write.hdfs.fileType=DataStream
weather.sources= Weather
weather.sources.Weather.bind = api.openweathermap.org/data/2.5/forecast/city?id=285787&APPID=8ce9bbbe446da25b19242763bdddb90a
weather.sources.Weather.username= abc
weather.sources.Weather.password= ********
weather.sources.Weather.channels=memory-channel
weather.sources.Weather.type = http
weather.sources.Weather.port = 11111
虽然我是 运行 的 flume 特工,但命令如下 flume-ng代理-fweather.conf-n天气
我收到以下错误
15/03/23 05:17:34 INFO node.PollingPropertiesFileConfigurationProvider: Reloading configuration file:weather.conf
15/03/23 05:17:34 INFO conf.FlumeConfiguration: Processing:hdfs-write
15/03/23 05:17:34 INFO conf.FlumeConfiguration: Processing:hdfs-write
15/03/23 05:17:34 INFO conf.FlumeConfiguration: Processing:hdfs-write
15/03/23 05:17:34 INFO conf.FlumeConfiguration: Processing:hdfs-write
15/03/23 05:17:34 INFO conf.FlumeConfiguration: Added sinks: hdfs-write Agent: weather
15/03/23 05:17:34 INFO conf.FlumeConfiguration: Processing:hdfs-write
15/03/23 05:17:34 INFO conf.FlumeConfiguration: Processing:hdfs-write
15/03/23 05:17:34 INFO conf.FlumeConfiguration: Post-validation flume configuration contains configuration for agents: [weather]
15/03/23 05:17:34 INFO node.AbstractConfigurationProvider: Creating channels
15/03/23 05:17:34 INFO channel.DefaultChannelFactory: Creating instance of channel memory-channel type memory
15/03/23 05:17:34 INFO node.AbstractConfigurationProvider: Created channel memory-channel
15/03/23 05:17:34 INFO source.DefaultSourceFactory: Creating instance of sourceWeather, type http
15/03/23 05:17:35 INFO sink.DefaultSinkFactory: Creating instance of sink: hdfs-write, type: logger
15/03/23 05:17:35 INFO node.AbstractConfigurationProvider: Channel memory-channel connected to [Weather, hdfs-write]
15/03/23 05:17:35 INFO node.Application: Starting new configuration:{
sourceRunners:{Weather=EventDrivenSourceRunner: {
source:org.apache.flume.source.http.HTTP
Source{name:Weather,state:IDLE} }} sinkRunners:{hdfs-write=SinkRunner: {
policy:org.apache.flume.sink.DefaultSinkProcessor@529d1dd7 counterGroup:{
name:null counters:{} } }} channels:{memory-
channel=org.apache.flume.channel.MemoryChannel{name: memory-channel}} }
15/03/23 05:17:35 INFO node.Application: Starting Channel memory-channel
15/03/23 05:17:35 INFO instrumentation.MonitoredCounterGroup: Monitored
countergroup for type: CHANNEL, name: memory-channel: Successfully
registered new MBean.
15/03/23 05:17:35 INFO instrumentation.MonitoredCounterGroup: Component
type: CHANNEL, name: memory-channel started
15/03/23 05:17:35 INFO node.Application: Starting Sink hdfs-write
15/03/23 05:17:35 INFO node.Application: Starting Source Weather
15/03/23 05:17:35 INFO mortbay.log: Logging to
org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via
org.mortbay.log.Slf4jLog
15/3/23 05:17:35 INFO mortbay.log: jetty-6.1.26
15/03/23 05:17:36 WARN mortbay.log: failed
SelectChannelConnector@api.openweathermap.org/data/2.5/forecast/city?
id=285787&APPID=8ce9bbbe446da25b19242763bdddb90a:11111:
java.net.SocketException: Unresolved address
15/03/23 05:17:36 WARN mortbay.log: failed Server@642c189d:
java.net.SocketException: Unresolved address
15/03/23 05:17:36 ERROR http.HTTPSource: Error while starting HTTPSource.
Exception follows.java.net.SocketException: Unresolved address
at sun.nio.ch.Net.translateToSocketException(Net.java:157)
at sun.nio.ch.Net.translateException(Net.java:183)
at sun.nio.ch.Net.translateException(Net.java:189)
at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:76)
at org.mortbay.jetty.nio.SelectChannelConnector.open
(SelectChannelConnector.java:216)
at org.mortbay.jetty.nio.SelectChannelConnector.doStart(SelectChannelCon
nector.java:315)
at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java_
at org.mortbay.jetty.Server.doStart(Server.java:235)
at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java)
at org.apache.flume.source.http.HTTPSource.start(HTTPSource.java:220)
at org.apache.flume.source.EventDrivenSourceRunner.start(EventDrivenSour
ceRunner.java:44)
at org.apache.flume.lifecycle.LifecycleSupervisor$MonitorRunnable.run
(LifecycleSupervisor.java:251)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.
access1(ScheduledThreadPoolExecutor.java:178)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.
run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.
java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.nio.channels.UnresolvedAddressException
at sun.nio.ch.Net.checkAddress(Net.java:127)
at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java)
at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
... 15 more
15/03/23 05:17:36 ERROR lifecycle.LifecycleSupervisor: Unable to start
EventDrivenSourceRunner: {
source:org.apache.flume.source.http.HTTPSource{name:Weather,state:IDLE} }
- Exception follows.
java.lang.RuntimeException: java.net.SocketException: Unresolved address
at com.google.common.base.Throwables.propagate(Throwables.java:156)
at org.apache.flume.source.http.HTTPSource.start(HTTPSource.java:224)
at org.apache.flume.source.EventDrivenSourceRunner.start
(EventDrivenSourceRunner.java:44)
at org.apache.flume.lifecycle.LifecycleSupervisor$MonitorRunnable.run(Li
fecycleSupervisor.java:251)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.
access1(ScheduledThreadPoolExecutor.java:178)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.
run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.
java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.SocketException: Unresolved address
at sun.nio.ch.Net.translateToSocketException(Net.java:157)
at sun.nio.ch.Net.translateException(Net.java:183)
at sun.nio.ch.Net.translateException(Net.java:189)
at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:76)
at org.mortbay.jetty.nio.SelectChannelConnector.open(SelectChannelConnec
tor.java:216)
at org.mortbay.jetty.nio.SelectChannelConnector.doStart(SelectChannelCon
nector.java:315)
at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:
at org.mortbay.jetty.Server.doStart(Server.java:235)
at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:
at org.apache.flume.source.http.HTTPSource.start(HTTPSource.java:220)
... 9 more
Caused by: java.nio.channels.UnresolvedAddressException
at sun.nio.ch.Net.checkAddress(Net.java:127)
at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java
at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
... 15 more
15/03/23 05:17:39 ERROR lifecycle.LifecycleSupervisor: Unable to start
EventDrivenSourceRunner: {
source:org.apache.flume.source.http.HTTPSource{name:Weather,state:IDLE}
} - Exception follows.
java.lang.IllegalStateException: Running HTTP Server found in source:
Weather before I started one.Will not attempt to start.
at com.google.common.base.Preconditions.checkState(Preconditions.java:14
at org.apache.flume.source.http.HTTPSource.start(HTTPSource.java:189)
at org.apache.flume.source.EventDrivenSourceRunner.start(EventDrivenSour
ceRunner.java:44)
at org.apache.flume.lifecycle.LifecycleSupervisor$MonitorRunnable.run(Li
fecycleSupervisor.java:251)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.
access1(ScheduledThreadPoolExecutor.java:178)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.
run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.
java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
.java:615)
at java.lang.Thread.run(Thread.java:745)
^C15/03/23 05:17:41 INFO lifecycle.LifecycleSupervisor: Stopping
lifecycle supervisor 10
15/03/23 05:17:41 INFO node.PollingPropertiesFileConfigurationProvider:
Configuration provider stopping
请帮我解决这个问题?
或者我必须在配置 flume 代理之前做其他事情。
或者我应该使用nutch来爬取数据,还是应该使用storm。
请帮助我最好的选择是什么
提前致谢
HTTPSource
的 bind
参数指定您的代理将要侦听数据的 IP 地址或主机名。不是爬虫端点,而是爬虫必须发送数据的端点(连同端口)。
话虽如此,我建议使用 Exec
源来执行一个脚本,该脚本会抓取 openweather.org 并在输出中生成数据;然后将该输出用作代理的输入数据。