在 Spark DataFrames 中读取 json 行的 LZO 文件

Question

我在 HDFS 中有一个大型索引 lzo 文件，我想在 spark 数据帧中读取它。该文件包含 json 行文档。

posts_dir='/data/2016/01'

posts_dir 具有以下内容：

/data/2016/01/posts.lzo
/data/2016/01/posts.lzo.index

下面的工作但没有使用索引，因此需要很长时间，因为它只使用一个映射器。

posts = spark.read.json(posts_dir)

有没有办法让它利用索引？

Answer 1

我通过首先创建一个识别索引的 RDD 然后使用 from_json 函数将每一行变成 StructType 来解决这个问题，有效地产生与 spark.read.json(...)[=14 相似的结果=]

posts_rdd = sc.newAPIHadoopFile(posts_dir,
                                'com.hadoop.mapreduce.LzoTextInputFormat',
                                'org.apache.hadoop.io.LongWritable',
                                'org.apache.hadoop.io.Text')

posts_df = posts_rdd.map(lambda x:Row(x[1]))\
                    .toDF(['raw'])\
                    .select(F.from_json('raw', schema=posts_schema).alias('json')).select('json.*')

我不知道更好或更直接的方法。

在 Spark DataFrames 中读取 json 行的 LZO 文件

Reading LZO file of json lines in Spark DataFrames

apache-spark

hadoop-lzo

spark-dataframe