如何使用 akka 流或 alpakka 从 S3 读取镶木地板文件

Question

我正在尝试使用 the official doc 之后的 akka 流从 S3 读取 parque 文件，但出现此错误 java.io.IOException: No FileSystem for scheme: s3a 这是触发该异常的代码。我将非常感谢 clue/example 我应该如何正确地做到这一点

val path = s"s3a://bucketName/path/to/foo/part-00000-656418ee-7cc0-42ee-93e-aaa69ee6f916.c000.snappy.parquet"
val conf: Configuration = new Configuration()
conf.setBoolean(AvroReadSupport.AVRO_COMPATIBILITY, true)
val file = HadoopInputFile.fromPath(new Path(path), conf)
val reader: ParquetReader[GenericRecord] =
    AvroParquetReader.builder[GenericRecord](file).withConf(conf).build()
    //should read the file lines here but not there yet ...

Answer 1

您的类路径中很可能缺少 hadoop-aws 库。

看这里：https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html

此外，此 SO 还提供了一些有关如何设置凭据以访问 S3 的更多详细信息：

一旦 AvroParquetReader 正确初始化，您就可以根据 Alpakka Avro Parquet 文档 (https://doc.akka.io/docs/alpakka/current/avroparquet.html#source-initiation)

从中创建 Akka Stream 的 Source

如何使用 akka 流或 alpakka 从 S3 读取镶木地板文件

How to read parquet file from S3 using akka streams or alpakka

scala

amazon-s3

parquet

akka-stream

alpakka