我如何访问 flume-to-kafka 管道中的完整数据集?

How do i access full dataset in flume-to-kafka pipeline?

我正在读取文本文件 SMSSpamCollection 作为 flume-source,将其发布到 kafka 主题,这是一个 flume-sink。

     # Agent Name:
     a1.sources = r1
     a1.sinks = sample
     a1.channels = sample-channel


     # Source configuration:
     a1.sources.r1.type = exec
     a1.sources.r1.command = tail -f /Users/val/Documents/code/spark/m11_to_Upload/SMSSpamCollection
     a1.sources.r1.logStdErr = true

     # Sink type
     #a1.sinks.sample.type = logger

     # Buffers events in memory to channel
     a1.channels.sample-channel.type = memory
     a1.channels.sample-channel.capacity = 1000
     a1.channels.sample-channel.transactionCapacity = 100

     # Bind the source and sink to the channel
     a1.sources.r1.channels.selector.type = replicating
     a1.sources.r1.channels = sample-channel

     # Related settings Kafka, topic, and host channel where it set the source
     a1.sinks.sample.type = org.apache.flume.sink.kafka.KafkaSink
     a1.sinks.sample.topic = sample_topic
     a1.sinks.sample.brokerList = 127.0.0.1:9092
     a1.sinks.sample.requiredAcks = 1
     a1.sinks.sample.batchSize = 20
     a1.sinks.sample.channel = sample-channel

我用这个命令

    flume-ng agent --conf conf --conf-file /usr/local/Cellar/flume/1.9.0/libexec/conf/flume-sample.conf  -Dflume.root.logger=DEBUG,console --name a1 -Xmx512m -Xms256m 

当我从kafka主题读取数据时

    kafka-console-consumer --topic sample_topic --from-beginning --bootstrap-server localhost:9092

我只看到原始文件中的最后 10 条记录。

    ham Ok lor... Sony ericsson salesman... I ask shuhui then she say quite gd 2 use so i considering...
    ham Ard 6 like dat lor.
    ham Why don't you wait 'til at least wednesday to see if you get your .
    ham Huh y lei...
    spam    REMINDER FROM O2: To get 2.50 pounds free call credit and details of great offers pls reply 2 this text with your valid name, house no and postcode
    spam    This is the 2nd time we have tried 2 contact u. U have won the £750 Pound prize. 2 claim is easy, call 087187272008 NOW1! Only 10p per minute. BT-national-rate.
    ham Will ü b going to esplanade fr home?
    ham Pity, * was in mood for that. So...any other suggestions?
    ham The guy did some bitching but I acted like i'd be interested in buying something else next week and he gave it to us for free
    ham Rofl. Its true to its name

查看所有记录的正确方法是什么?

您正在使用 tail,它默认显示文件的最后 10 行。

改为使用:

a1.sources.r1.command = tail -c +0 -f /Users/val/Documents/code/spark/m11_to_Upload/SMSSpamCollection

-c +0 告诉 tail 从文件的第一个字符开始。

顺便说一句,另一种方法是将 Kafka Connect 与诸如 Spooldir or File Pulse 插件之类的东西一起使用。