配置单元如何理解输入数据的大小？

How hive understands the size of input data?

我正在尝试了解 Hive 的内部结构。 class/method hive 使用什么来了解 S3 中数据集的大小？

Hive 建立在 hadoop 之上，并使用 hadoop 的 HDFS 作为 API for input/output。更准确地说，它有一个 InputFormat 和 OutputFormat，当您创建一个从 FileSystem 对象 (https://hadoop.apache.org/docs/r2.6.1/api/org/apache/hadoop/fs/FileSystem.html) 获取数据的 table 时，它们是可配置的。 FileSystem 对象抽象了文件管理的大部分方面，因此 hive 不必担心文件是在 S3 还是 HDFS 上，因为 hadoop/HDFS 层会处理这个问题。在处理文件时，每个文件都有一个 URL 的路径（例如， hdfs:///dir/file 或 s3:///bucket/path ）。 Path class 使用 S3 url 的 getFileSystem method, which would be S3FileSystem 解析文件系统。从 FileSystem 对象，它可以使用 getLen 方法使用 FileStatus 的方法获取文件大小。

如果您想查看在 hive 源中的何处完成此操作，通常在 org.apache.hadoop.hive.ql.io.CombineHiveInputFormat 中，这是 hive.input.format 的默认设置。

配置单元如何理解输入数据的大小？

How hive understands the size of input data?

apache

hive

hiveql