flume 加载 csv 文件优于 hdfs sink
flume load csv files excels to hdfs sink
我已将 Flume 源配置为 Spooldir 类型。我有很多 CSV 文件,.xl3 和 .xls,我希望我的 Flume 代理将所有文件从 spooldir 加载到 HDFS 接收器。然而flume代理return异常
这是我对 flume 来源的配置:
agent.sources.s1.type = spooldir
agent.sources.s1.spoolDir = /my-directory
agent.sources.s1.basenameHeader = true
agent.sources.batchSize = 10000
和我的 HDFS 接收器:
agent.sinks.sk1.type = hdfs
agent.sinks.sk1.hdfs.path = hdfs://...:8020/user/importflume/%Y/%m/%d/%H
agent.sinks.sk1.hdfs.filePrefix = %{basename}
agent.sinks.sk1.hdfs.rollSize = 0
agent.sinks.sk1.hdfs.rollCount = 0
agent.sinks.sk1.hdfs.useLocalTimeStamp = true
agent.sinks.sk1.hdfs.batchsize = 10000
agent.sinks.sk1.hdfs.fileType = DataStream
agent.sinks.sk1.serializer = avro_event
agent.sinks.sk1.serializer.compressionCodec = snappy
您可以对假脱机目录使用以下配置。只需在以下配置中提供本地文件系统的路径和 HDFS 位置。
#Flume Configuration Starts
# Define a file channel called fileChannel on agent1
agent1.channels.fileChannel1_1.type = file
# on linux FS
agent1.channels.fileChannel1_1.capacity = 200000
agent1.channels.fileChannel1_1.transactionCapacity = 1000
# Define a source for agent1
agent1.sources.source1_1.type = spooldir
# on linux FS
#Spooldir in my case is /home/hadoop/Desktop/flume_sink
agent1.sources.source1_1.spoolDir = 'path'
agent1.sources.source1_1.fileHeader = false
agent1.sources.source1_1.fileSuffix = .COMPLETED
agent1.sinks.hdfs-sink1_1.type = hdfs
#Sink is /flume_import under hdfs
agent1.sinks.hdfs-sink1_1.hdfs.path = hdfs://'path'
agent1.sinks.hdfs-sink1_1.hdfs.batchSize = 1000
agent1.sinks.hdfs-sink1_1.hdfs.rollSize = 268435456
agent1.sinks.hdfs-sink1_1.hdfs.rollInterval = 0
agent1.sinks.hdfs-sink1_1.hdfs.rollCount = 50000000
agent1.sinks.hdfs-sink1_1.hdfs.writeFormat=Text
agent1.sinks.hdfs-sink1_1.hdfs.fileType = DataStream
agent1.sources.source1_1.channels = fileChannel1_1
agent1.sinks.hdfs-sink1_1.channel = fileChannel1_1
agent1.sinks = hdfs-sink1_1
agent1.sources = source1_1
agent1.channels = fileChannel1_1
您还可以参考 Flume 假脱机目录中的 this blog 了解更多信息。
我已将 Flume 源配置为 Spooldir 类型。我有很多 CSV 文件,.xl3 和 .xls,我希望我的 Flume 代理将所有文件从 spooldir 加载到 HDFS 接收器。然而flume代理return异常
这是我对 flume 来源的配置:
agent.sources.s1.type = spooldir
agent.sources.s1.spoolDir = /my-directory
agent.sources.s1.basenameHeader = true
agent.sources.batchSize = 10000
和我的 HDFS 接收器:
agent.sinks.sk1.type = hdfs
agent.sinks.sk1.hdfs.path = hdfs://...:8020/user/importflume/%Y/%m/%d/%H
agent.sinks.sk1.hdfs.filePrefix = %{basename}
agent.sinks.sk1.hdfs.rollSize = 0
agent.sinks.sk1.hdfs.rollCount = 0
agent.sinks.sk1.hdfs.useLocalTimeStamp = true
agent.sinks.sk1.hdfs.batchsize = 10000
agent.sinks.sk1.hdfs.fileType = DataStream
agent.sinks.sk1.serializer = avro_event
agent.sinks.sk1.serializer.compressionCodec = snappy
您可以对假脱机目录使用以下配置。只需在以下配置中提供本地文件系统的路径和 HDFS 位置。
#Flume Configuration Starts
# Define a file channel called fileChannel on agent1
agent1.channels.fileChannel1_1.type = file
# on linux FS
agent1.channels.fileChannel1_1.capacity = 200000
agent1.channels.fileChannel1_1.transactionCapacity = 1000
# Define a source for agent1
agent1.sources.source1_1.type = spooldir
# on linux FS
#Spooldir in my case is /home/hadoop/Desktop/flume_sink
agent1.sources.source1_1.spoolDir = 'path'
agent1.sources.source1_1.fileHeader = false
agent1.sources.source1_1.fileSuffix = .COMPLETED
agent1.sinks.hdfs-sink1_1.type = hdfs
#Sink is /flume_import under hdfs
agent1.sinks.hdfs-sink1_1.hdfs.path = hdfs://'path'
agent1.sinks.hdfs-sink1_1.hdfs.batchSize = 1000
agent1.sinks.hdfs-sink1_1.hdfs.rollSize = 268435456
agent1.sinks.hdfs-sink1_1.hdfs.rollInterval = 0
agent1.sinks.hdfs-sink1_1.hdfs.rollCount = 50000000
agent1.sinks.hdfs-sink1_1.hdfs.writeFormat=Text
agent1.sinks.hdfs-sink1_1.hdfs.fileType = DataStream
agent1.sources.source1_1.channels = fileChannel1_1
agent1.sinks.hdfs-sink1_1.channel = fileChannel1_1
agent1.sinks = hdfs-sink1_1
agent1.sources = source1_1
agent1.channels = fileChannel1_1
您还可以参考 Flume 假脱机目录中的 this blog 了解更多信息。