每小时将推文保存到单个 Flume 数据文件的 flume.conf 参数应该是多少?
What should be flume.conf parametres for save tweets to single FlumeData file per hour?
我们正在按 /user/flume/2016/06/28/13/FlumeData 这样的目录顺序保存推文...。但它每小时创建超过 100 个 FlumeData file.I 已更改 TwitterAgent.sinks.HDFS.hdfs.rollSize = 52428800 (50 mb)
同样的事情发生了 again.After 我也尝试过更改 rollcount 参数但没有 work.How 我可以设置参数以每小时获取一个 FlumeData 文件吗?
那rollInterval
呢?你把它设置为零了吗?如果是,那么问题可能出在其他地方。如果 rollInterval
设置为某个值,它会覆盖 rollSize
和 rollCount
值。文件轮换可能会在文件大小达到 rollSize
值之前发生。另外,检查您设置的 HDFS 块大小。如果设置为太小的值,即使这样也可能导致文件滚动。
试试这个 -
TwitterAgent.sinks.HDFS.channel = MemChannel
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://hpc01:8020/user/flume/tweets/%Y/%m/%d/%H
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS.hdfs.batchSize = 100
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 0
TwitterAgent.sinks.HDFS.hdfs.rollInterval = 3600
TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 1000
TwitterAgent.channels.MemChannel.transactionCapacity = 100
TwitterAgent.sinks.HDFS.channel = MemChannel
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://hpc01:8020/user/flume/tweets/%Y/%m/%d/%H
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS.hdfs.batchSize = 1
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 10
TwitterAgent.sinks.HDFS.hdfs.rollInterval = 0
TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 10000
TwitterAgent.channels.MemChannel.transactionCapacity = 1000
我通过设置 rollInterval=3600 rollcount=0 和 batchSize=100 flume.conf 参数解决了这个问题,正如@vkgade 建议的
我们正在按 /user/flume/2016/06/28/13/FlumeData 这样的目录顺序保存推文...。但它每小时创建超过 100 个 FlumeData file.I 已更改 TwitterAgent.sinks.HDFS.hdfs.rollSize = 52428800 (50 mb)
同样的事情发生了 again.After 我也尝试过更改 rollcount 参数但没有 work.How 我可以设置参数以每小时获取一个 FlumeData 文件吗?
那rollInterval
呢?你把它设置为零了吗?如果是,那么问题可能出在其他地方。如果 rollInterval
设置为某个值,它会覆盖 rollSize
和 rollCount
值。文件轮换可能会在文件大小达到 rollSize
值之前发生。另外,检查您设置的 HDFS 块大小。如果设置为太小的值,即使这样也可能导致文件滚动。
试试这个 -
TwitterAgent.sinks.HDFS.channel = MemChannel
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://hpc01:8020/user/flume/tweets/%Y/%m/%d/%H
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS.hdfs.batchSize = 100
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 0
TwitterAgent.sinks.HDFS.hdfs.rollInterval = 3600
TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 1000
TwitterAgent.channels.MemChannel.transactionCapacity = 100
TwitterAgent.sinks.HDFS.channel = MemChannel
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://hpc01:8020/user/flume/tweets/%Y/%m/%d/%H
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS.hdfs.batchSize = 1
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 10
TwitterAgent.sinks.HDFS.hdfs.rollInterval = 0
TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 10000
TwitterAgent.channels.MemChannel.transactionCapacity = 1000
我通过设置 rollInterval=3600 rollcount=0 和 batchSize=100 flume.conf 参数解决了这个问题,正如@vkgade 建议的