转义序列不填充 hdfs 路径和文件前缀
Escape Sequences not populating hdfs path and file prefix
在我的 flume 流程中,我想要自定义动态 hdfs 路径,但没有数据被填充到拦截器。
示例数据:
188 17 2016-06-01 00:31:10 6200.041736 0
配置
agent2.sources.source2.interceptors = i2 i3 i4
agent2.sources.source2.interceptors.i2.type = regex_extractor
agent2.sources.source2.interceptors.i3.type = regex_extractor
agent2.sources.source2.interceptors.i4.type = regex_extractor
# regex to pick up the year
agent2.sources.source2.interceptors.i2.regex = (?<=\t)[0-9]{4}(?=-)
agent2.sources.source2.interceptors.i2.serializers = y
agent2.sources.source2.interceptors.i2.serializers.y.name = year
# regex to pick up the month
agent2.sources.source2.interceptors.i3.regex = (?<=-)[0-9]{2}(?=-)
agent2.sources.source2.interceptors.i3.serializers = m
agent2.sources.source2.interceptors.i3.serializers.m.name = month
# regex to pick up the day
agent2.sources.source2.interceptors.i4.regex = (?<=-)[0-9]{2}(?=\t)
agent2.sources.source2.interceptors.i4.serializers = d
agent2.sources.source2.interceptors.i4.serializers.d.name = day
# Define the HDFS sink 2 –year and month
agent2.sinks.sink-hdfs2.type = hdfs
agent2.sinks.sink-hdfs2.hdfs.path = /group-project/consumption/%{year}/%{month}
agent2.sinks.sink-hdfs2.hdfs.filePrefix = %{year}-%{month}
agent2.sinks.sink-hdfs2.hdfs.fileSuffix = .txt
年份和日期的前瞻和后瞻将只匹配制表符。它们不会匹配多个空格。你最好使用 \s
.
此外 Flume 正则表达式符号需要两个反斜杠,\t
而不是 \t
。
或者,您可以使用一个正则表达式来获取整个日期,并使用多个捕获组将它们分配给不同的序列化程序。例如,(\d{4})-(\d{2})-(\d{2})
Flume User Guide有个很好的例子:
If the Flume event body contained 1:2:3.4foobar5
and the following configuration was used
a1.sources.r1.interceptors.i1.regex = (\d):(\d):(\d)
a1.sources.r1.interceptors.i1.serializers = s1 s2 s3
a1.sources.r1.interceptors.i1.serializers.s1.name = one
a1.sources.r1.interceptors.i1.serializers.s2.name = two
a1.sources.r1.interceptors.i1.serializers.s3.name = three
The extracted event will contain the same body but the following headers will have been added one=>1, two=>2, three=>3
在我的 flume 流程中,我想要自定义动态 hdfs 路径,但没有数据被填充到拦截器。
示例数据: 188 17 2016-06-01 00:31:10 6200.041736 0
配置
agent2.sources.source2.interceptors = i2 i3 i4
agent2.sources.source2.interceptors.i2.type = regex_extractor
agent2.sources.source2.interceptors.i3.type = regex_extractor
agent2.sources.source2.interceptors.i4.type = regex_extractor
# regex to pick up the year
agent2.sources.source2.interceptors.i2.regex = (?<=\t)[0-9]{4}(?=-)
agent2.sources.source2.interceptors.i2.serializers = y
agent2.sources.source2.interceptors.i2.serializers.y.name = year
# regex to pick up the month
agent2.sources.source2.interceptors.i3.regex = (?<=-)[0-9]{2}(?=-)
agent2.sources.source2.interceptors.i3.serializers = m
agent2.sources.source2.interceptors.i3.serializers.m.name = month
# regex to pick up the day
agent2.sources.source2.interceptors.i4.regex = (?<=-)[0-9]{2}(?=\t)
agent2.sources.source2.interceptors.i4.serializers = d
agent2.sources.source2.interceptors.i4.serializers.d.name = day
# Define the HDFS sink 2 –year and month
agent2.sinks.sink-hdfs2.type = hdfs
agent2.sinks.sink-hdfs2.hdfs.path = /group-project/consumption/%{year}/%{month}
agent2.sinks.sink-hdfs2.hdfs.filePrefix = %{year}-%{month}
agent2.sinks.sink-hdfs2.hdfs.fileSuffix = .txt
年份和日期的前瞻和后瞻将只匹配制表符。它们不会匹配多个空格。你最好使用 \s
.
此外 Flume 正则表达式符号需要两个反斜杠,\t
而不是 \t
。
或者,您可以使用一个正则表达式来获取整个日期,并使用多个捕获组将它们分配给不同的序列化程序。例如,(\d{4})-(\d{2})-(\d{2})
Flume User Guide有个很好的例子:
If the Flume event body contained
1:2:3.4foobar5
and the following configuration was used
a1.sources.r1.interceptors.i1.regex = (\d):(\d):(\d)
a1.sources.r1.interceptors.i1.serializers = s1 s2 s3
a1.sources.r1.interceptors.i1.serializers.s1.name = one
a1.sources.r1.interceptors.i1.serializers.s2.name = two
a1.sources.r1.interceptors.i1.serializers.s3.name = three
The extracted event will contain the same body but the following headers will have been added
one=>1, two=>2, three=>3