HDFS 文件编码转换器

Question

我正在尝试将 HDFS 文件从 UTF-8 转换为 ISO-8859-1。

我写了一个 Java 小程序：

String theInputFileName="my-utf8-input-file.csv";
String theOutputFileName="my-iso8859-output-file.csv";
Charset inputCharset = StandardCharsets.UTF_8;
Charset outputCharset = StandardCharsets.ISO_8859_1;

try (
    final FSDataInputStream in = theFileSystem.open(new Path(theInputFileName)) ;
    final FSDataOutputStream out = theFileSystem.create(new Path(theOutputFileName))
)        
{
    try (final BufferedReader reader = new BufferedReader(new InputStreamReader(in, inputCharset)))
    {
        String line;
        while ((line = reader.readLine()) != null)
        {
            out.write(line.getBytes(this.outputCharset));
            out.write(this.lineSeparator.getBytes(this.outputCharset));
        }
    }
} catch (IllegalArgumentException | IOException e)
{
    RddFileWriter.LOGGER.error(e, "Exception on file '%s'", theFileNameOutput);
}

此代码使用 Spark 通过 Hadoop 集群执行（输出数据通常由 RDD 提供）

为了简化我的问题，我删除了 RDD/Datasets 部分以直接处理 HDFS 文件。

当我执行代码时：

Localy 在我的 DEV 计算机上：有效！，本地输出文件编码为 ISO-8859-1
on EDGE server：通过使用 HDFS 文件的 spark-submit 命令有效！ HDFS 输出文件编码为 ISO-8859-1
在 Datanode 上通过 oozie：它不起作用:-(：HDFS outfile 在 UTF-8 而不是 ISO-8859-1[=46= 中编码]

我不明白哪些属性（或其他什么）可能导致行为发生变化

版本：

Hadoop：v2.7.3
Spark：v2.2.0
Java : 1.8

期待您的帮助。提前致谢

Answer 1

终于找到问题的根源了。

集群上的输入文件已损坏，整个文件的编码不一致。

外部数据每天汇总，最近编码已从 ISO 更改为 UTF8，恕不另行通知...

更简单地说：

开头包含错误的转换 « É Éª Ã¨ » 而不是 « é ê è »
结尾编码正确

我们已经拆分、固定编码并合并数据以修复输入。

最终代码工作正常。

private void changeEncoding(
            final Path thePathInputFileName,final Path thePathOutputFileName,
            final Charset theInputCharset,  final Charset theOutputCharset,
            final String theLineSeparator
        ) {
    try (
        final FSDataInputStream in = this.fileSystem.open(thePathInputFileName);
        final FSDataOutputStream out = this.fileSystem.create(thePathOutputFileName);
        final BufferedReader reader = new BufferedReader(new InputStreamReader(in, theInputCharset));
        final BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(out, theOutputCharset));) {
        
        String line;
        while ((line = reader.readLine()) != null) {
            writer.write(line);
            writer.write(theLineSeparator);
        }
        
    } catch (IllegalArgumentException | IOException e) {
        LOGGER.error(e, "Exception on file '%s'", theOutputFileName);
    }
}

停止研究！ ;-)

HDFS 文件编码转换器

HDFS File Encoding Converter

java

encoding

hadoop

hdfs

apache-spark