在写入日志文件时使用 Flume 将日志文件提取到 HDFS

Ingesting a log file into HDFS using Flume while it is being written

在写入日志文件时将日志文件摄取到 HDFS 的最佳方法是什么?我正在尝试配置 Apache Flume,并且正在尝试配置可以为我提供数据可靠性的来源。我试图配置 "exec",后来也查看了 "spooldir",但是 flume.apache.org 上的以下文档对我自己的意图提出了质疑 -

执行来源:

One of the most commonly requested features is the use case like- "tail -F file_name" where an application writes to a log file on disk and Flume tails the file, sending each line as an event. While this is possible, there’s an obvious problem; what happens if the channel fills up and Flume can’t send an event? Flume has no way of indicating to the application writing the log file, that it needs to retain the log or that the event hasn’t been sent for some reason. Your application can never guarantee data has been received when using a unidirectional asynchronous interface such as ExecSource!

假脱机目录来源:

Unlike the Exec source, "spooldir" source is reliable and will not miss data, even if Flume is restarted or killed. In exchange for this reliability, only immutable files must be dropped into the spooling directory. If a file is written to after being placed into the spooling directory, Flume will print an error to its log file and stop processing.

我可以使用任何更好的方法来确保 Flume 不会错过任何事件并实时读取?

我推荐使用 Spooling Directory Source,因为它的可靠性。不可变性要求的解决方法是将文件组合在第二个目录中,一旦它们达到一定大小(以字节或日志量表示),将它们移动到假脱机目录。