Spark 从配置单元 select 还是从文件 select 更好

Question

我只是想知道人们对从 Hive 读取与从 .csv 文件或 .txt 文件或 .ORC 文件或 .parquet 文件读取有何想法。假设底层 Hive table 是具有相同文件格式的外部 table，您是从 Hive table 还是从底层文件本身读取，为什么？

麦克

Answer 1

据我了解，尽管通常 .ORC 更适合平面结构，parquet 更适合嵌套结构，但 spark 已针对 parquet 进行了优化。因此，建议使用带有 spark.

的格式

此外，您从 parquet 读取的所有表格的 Metadata 无论如何都将存储在 hive 中。这是 spark 文档：Spark SQL caches Parquet metadata for better performance. When Hive metastore Parquet table conversion is enabled, metadata of those converted tables are also cached. If these tables are updated by Hive or other external tools, you need to refresh them manually to ensure consistent metadata.

我倾向于尽快将数据转换为 parquet 格式并将其存储 alluxio 由 hdfs 支持。这使我能够为 read/write 操作实现更好的性能，并限制使用 cache。

希望对您有所帮助。

Answer 2

tl;dr：我会直接从 parquet 文件中读取它

我正在使用 Spark 1.5.2 和 Hive 1.2.1 对于 500 万行 X 100 列 table，我记录的一些时间是

val dffile = sqlContext.read.parquet("/path/to/parquets/*.parquet")
val dfhive = sqlContext.table("db.table")

dffile count --> 0.38s; dfhive count --> 8.99s

dffile sum(col) --> 0.98s; dfhive sum(col) --> 8.10s

dffile substring(col) --> 2.63s; dfhive substring(col) --> 7.77s

dffile where(col=value) --> 82.59s; dfhive where(col=value) --> 157.64s

请注意，这些是使用较旧版本的 Hive 和较旧版本的 Spark 完成的，因此我无法评论这两种读取机制之间的速度提升情况

Spark 从配置单元 select 还是从文件 select 更好

Is it better for Spark to select from hive or select from file

hive

flat-file

apache-spark

parquet

spark-dataframe