如何读取 pyspark avro 文件并提取值?
How to read pyspark avro file and extract the values?
如何读取 pyspark 中的 twitter.avro 文件并从中提取值?
rdd=sc.textFile("twitter.asvc")
工作正常
但是当我这样做的时候
rdd1=sc.textFile("twitter.avro")
rdd1.collect()
我得到的输出低于
['Obj\x01\x02\x16avro.schema\x04{"type":"record","name":"episodes","namespace":"testing.hive.avro.serde","fields":[{"name":"title","type":"string","doc":"episode
title"},{"name":"air_date","type":"string","doc":"initial
date"},{"name":"doctor","type":"int","doc":"main actor playing the
Doctor in episode"}]}\x00kR\x03LS\x17m|]Z^{0\x10\x04"The Eleventh
Hour\x183 April 2010\x16"The Doctor\'s Wife\x1614 May 2011\x16&Horror
of Fang Rock 3 September 1977\x08$An Unearthly Child 23 November
1963\x02*The Mysterious Planet 6 September 1986\x0c\x08Rose\x1a26
March 2005\x12.The Power of the Daleks\x1e5 November
1966\x04\x14Castrolava\x1c4 January 1982', 'kR\x03LS\x17m|]Z^{0']
是否有 python 库来读取这种格式?
您应该使用特定于 Avro 文件的 FileInputFormat。
不幸的是我没有使用python所以我只能link给你一个解决方案。你可以看看:https://github.com/apache/spark/blob/master/examples/src/main/python/avro_inputformat.py
最有趣的部分是这个:
avro_rdd = sc.newAPIHadoopFile(
path,
"org.apache.avro.mapreduce.AvroKeyInputFormat",
"org.apache.avro.mapred.AvroKey",
"org.apache.hadoop.io.NullWritable",
keyConverter="org.apache.spark.examples.pythonconverters.AvroWrapperToJavaConverter",
conf=conf)
如何读取 pyspark 中的 twitter.avro 文件并从中提取值?
rdd=sc.textFile("twitter.asvc")
工作正常
但是当我这样做的时候
rdd1=sc.textFile("twitter.avro")
rdd1.collect()
我得到的输出低于
['Obj\x01\x02\x16avro.schema\x04{"type":"record","name":"episodes","namespace":"testing.hive.avro.serde","fields":[{"name":"title","type":"string","doc":"episode title"},{"name":"air_date","type":"string","doc":"initial date"},{"name":"doctor","type":"int","doc":"main actor playing the Doctor in episode"}]}\x00kR\x03LS\x17m|]Z^{0\x10\x04"The Eleventh Hour\x183 April 2010\x16"The Doctor\'s Wife\x1614 May 2011\x16&Horror of Fang Rock 3 September 1977\x08$An Unearthly Child 23 November 1963\x02*The Mysterious Planet 6 September 1986\x0c\x08Rose\x1a26 March 2005\x12.The Power of the Daleks\x1e5 November 1966\x04\x14Castrolava\x1c4 January 1982', 'kR\x03LS\x17m|]Z^{0']
是否有 python 库来读取这种格式?
您应该使用特定于 Avro 文件的 FileInputFormat。
不幸的是我没有使用python所以我只能link给你一个解决方案。你可以看看:https://github.com/apache/spark/blob/master/examples/src/main/python/avro_inputformat.py
最有趣的部分是这个:
avro_rdd = sc.newAPIHadoopFile(
path,
"org.apache.avro.mapreduce.AvroKeyInputFormat",
"org.apache.avro.mapred.AvroKey",
"org.apache.hadoop.io.NullWritable",
keyConverter="org.apache.spark.examples.pythonconverters.AvroWrapperToJavaConverter",
conf=conf)