如何为 DatumWriter 设置 'charset' ||将包含阿拉伯字符的avro写入HDFS

Question

部分数据包含阿拉伯语格式的值，写入数据时，reader代码/hadoop fs -text命令显示??而不是阿拉伯字符。

1) 作家

// avro object is provided as SpecificRecordBase
Path path = new Path(pathStr);
DatumWriter<SpecificRecord> datumWriter = new SpecificDatumWriter<>();
FileSystem fs = FileSystem.get(URI.create(hdfsUri), conf); // HDFS File System

FSDataOutputStream outputStream = fs.create(path);
DataFileWriter<SpecificRecord> dataFileWriter = new DataFileWriter<>(datumWriter);

Schema schema = getSchema(); // method to get schema
dataFileWriter.setCodec(CodecFactory.snappyCodec());
dataFileWriter.create(schema, outputStream);
dataFileWriter.append(avroObject);

2) Reader

Configuration conf = new Configuration();
FsInput in = new FsInput(new Path(hdfsFilePathStr), conf);
DatumReader<Row> datumReader = new GenericDatumReader<>();
DataFileReader<Row> dataFileReader = new DataFileReader<>(in, datumReader);
GenericRecord outputData = (GenericRecord) dataFileReader.iterator.next();

我试过 hadoop fs -text {filePath} 命令，阿拉伯语的值也显示为 ??。

改变写入数据的格式真的很困难，因为同一个文件有很多消费者。

已尝试通读 SpecificRecordBase，但仍在阅读 ??。

编辑

还尝试了这些（在 reader 和 writer 中）：

Configuration conf = new Configuration();
conf.set("file.encoding", StandardCharsets.UTF_16.displayName());

和

System.setProperty("file.encoding", StandardCharsets.UTF_16.displayName());

没有帮助。

Answer 1

显然，HDFS 不支持很多 non-english 字符。要解决此问题，请在您的 avro 模式中将字段从 String 更改为 bytes。

要将您的值从 String 转换为 bytes，请使用：

ByteBuffer.wrap(str.getBytes(StandardCharsets.UTF_8)).

然后，在读取时，将其转换回字符串，使用：

new String(byteData.array(), StandardCharsets.UTF_8).

您 reader 和作者中的其余代码保持不变。

这样做，对于英文字符 hadooop fs -text 命令将显示正确的文本，但对于 non-English 字符它可能会显示乱码，但是您的 reader 仍然能够创建 UTF-8 来自 ByteBuffer.

的字符串

如何为 DatumWriter 设置 'charset' ||将包含阿拉伯字符的avro写入HDFS

How to set 'charset' for DatumWriter || write avro that contains arabic characters to HDFS

java

hdfs

avro