通过 StreamSets Data Collector 流式传输时在文件名中附加 UUID

Question

我正在使用 HttpClient 源将文件从 HTTP url 流式传输到 Hadoop 目标，但目标中的文件名附加了一些随机 uuid。我希望文件名与源中的一样。

示例：源文件名为 README.txt ，目标文件名为 README_112e5d4b-4d85-4764-ab81-1d7b6e0237b2.txt

我希望目标文件名为 README.txt

我会告诉你我的配置。

HTTP Client :

General

Name : HTTP Client 1

Description : 

On Record Error : Send to Error

HTTP

Resource URL : http://files.data.gouv.fr/sirene/README.txt

Headers : 

Mode : Streaming

Per-Status Actions

HTTP Statis Code : 500 | Action for status : Retry with exponential backoff |

Base Backoff Interval (ms) : 1000 | Max Retries : 10

HTTP Method : GET

Body Time Zone : UTC (UTC)

Request Transfert Encoding : BUFFERED

HTTP Compression : None

Connect Timeout : 0

Read Timeout : 0

Authentication Type : None

Use OAuth 2

Use Proxy

Max Batch Size (records) : 1000

Batch Wait Time (ms) : 2000

Pagination

Pagination Mode : None

TLS

UseTLS

Timeout Handling

Action for timeout : Retry immediately

Max Retries : 10

Data Format

Date Format : Text

Compression Format : None

Max Line Length : 1024

Use Custom Delimiter

Charset : UTF-8

Ignore Control Characters

Logging 

Enable Request Logging

Hadoop FS Destination :

General

Name : Hadoop FS 1

Description : Writing into HDFS

Stage Library : CDH 5.10.1

Produce Events

Required Fields

Preconditions

On Record Error : Send to Error

Output Files

File Type : Text Files

Files Prefix : README

File Suffix : txt

Directory in Header

Directory Template : /user/username/

Data Time Zone : UTC (UTC)

Time Basis : ${time:now()}

Max Records in File : 0

Max File Size (MB) : 0

Idle Timeout : ${1 * HOURS}

Compression Codec : None

Use Roll Attribute

Validate HDFS Permissions : ON

Skip file recovery

Late Records

Late Record Time Limit (secs) : ${1 * HOURS}

Late Record Handling : Send to error

Data Format

Data Format : Text

Text Field Path : /text

Record Separator : \n

On Missing Field : Report Error

Charset : UTF-8

Answer 1

您可以配置文件名前缀和后缀，但无法删除 UUID。

在许多情况下，目录是 Hadoop 中最小的有用文件系统实体。由于多个客户端可能同时写入文件，并且由于文件大小超过给定阈值等操作原因，文件可能 'rolled'（当前输出文件关闭并打开新文件），Data Collector 确保文件名是避免意外数据丢失的唯一方法。

如果您真的想要这样做，有一个解决方法：在 Hadoop 目标上启用事件并使用 HDFS 文件元数据执行器重命名文件。有关更多信息，请参阅此 case study on output file management。

通过 StreamSets Data Collector 流式传输时在文件名中附加 UUID

Appending UUID in file name when streaming via StreamSets Data Collector

client

hadoop

http

streamsets