无法推断 Parquet 的架构。必须手动指定
Unable to infer schema for Parquet. It must be specified manually
我是 运行 EMR Notebooks 中的所有代码。
spark.version
'3.0.1-amzn-0'
temp_df.printSchema()
root
|-- dt: string (nullable = true)
|-- AverageTemperature: double (nullable = true)
|-- AverageTemperatureUncertainty: double (nullable = true)
|-- State: string (nullable = true)
|-- Country: string (nullable = true)
|-- year: integer (nullable = true)
|-- month: integer (nullable = true)
|-- day: integer (nullable = true)
|-- weekday: integer (nullable = true)
temp_df.show(2)
+----------+------------------+-----------------------------+-----+-------+----+-----+---+-------+
| dt|AverageTemperature|AverageTemperatureUncertainty|State|Country|year|month|day|weekday|
+----------+------------------+-----------------------------+-----+-------+----+-----+---+-------+
|1855-05-01| 25.544| 1.171| Acre| Brazil|1855| 5| 1| 3|
|1855-06-01| 24.228| 1.103| Acre| Brazil|1855| 6| 1| 6|
+----------+------------------+-----------------------------+-----+-------+----+-----+---+-------+
only showing top 2 rows
temp_df.write.parquet(path='s3://project7878/clean_data/temperatures.parquet',
mode='overwrite', partitionBy=['year'])
spark.read.parquet(path='s3://project7878/clean_data/temperatures.parquet').show(2)
An error was encountered:
Unable to infer schema for Parquet. It must be specified manually.;
Traceback (most recent call last):
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 353, in parquet
return self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths)))
File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 134, in deco
raise_from(converted)
File "<string>", line 3, in raise_from
pyspark.sql.utils.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;
我参考了其他堆栈溢出的帖子,但那里提供的解决方案(由于写入空文件而导致的问题)不适用于我。
请帮帮我。谢谢!!
不要在 read.parquet 调用中使用 path
:
>>> spark.read.parquet(path='a.parquet')
21/01/02 22:53:38 WARN DataSource: All paths were ignored:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home//bin/spark/python/pyspark/sql/readwriter.py", line 353, in parquet
return self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths)))
File "/home//bin/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__
File "/home//bin/spark/python/pyspark/sql/utils.py", line 134, in deco
raise_from(converted)
File "<string>", line 3, in raise_from
pyspark.sql.utils.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;
>>> spark.read.parquet('a.parquet')
DataFrame[_2: string, _1: double]
这是因为path
参数不存在。
使用load
有效
>>> spark.read.load(path='a', format='parquet')
DataFrame[_1: string, _2: string]
我是 运行 EMR Notebooks 中的所有代码。
spark.version
'3.0.1-amzn-0'
temp_df.printSchema()
root
|-- dt: string (nullable = true)
|-- AverageTemperature: double (nullable = true)
|-- AverageTemperatureUncertainty: double (nullable = true)
|-- State: string (nullable = true)
|-- Country: string (nullable = true)
|-- year: integer (nullable = true)
|-- month: integer (nullable = true)
|-- day: integer (nullable = true)
|-- weekday: integer (nullable = true)
temp_df.show(2)
+----------+------------------+-----------------------------+-----+-------+----+-----+---+-------+
| dt|AverageTemperature|AverageTemperatureUncertainty|State|Country|year|month|day|weekday|
+----------+------------------+-----------------------------+-----+-------+----+-----+---+-------+
|1855-05-01| 25.544| 1.171| Acre| Brazil|1855| 5| 1| 3|
|1855-06-01| 24.228| 1.103| Acre| Brazil|1855| 6| 1| 6|
+----------+------------------+-----------------------------+-----+-------+----+-----+---+-------+
only showing top 2 rows
temp_df.write.parquet(path='s3://project7878/clean_data/temperatures.parquet', mode='overwrite', partitionBy=['year'])
spark.read.parquet(path='s3://project7878/clean_data/temperatures.parquet').show(2)
An error was encountered:
Unable to infer schema for Parquet. It must be specified manually.;
Traceback (most recent call last):
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 353, in parquet
return self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths)))
File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 134, in deco
raise_from(converted)
File "<string>", line 3, in raise_from
pyspark.sql.utils.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;
我参考了其他堆栈溢出的帖子,但那里提供的解决方案(由于写入空文件而导致的问题)不适用于我。
请帮帮我。谢谢!!
不要在 read.parquet 调用中使用 path
:
>>> spark.read.parquet(path='a.parquet')
21/01/02 22:53:38 WARN DataSource: All paths were ignored:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home//bin/spark/python/pyspark/sql/readwriter.py", line 353, in parquet
return self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths)))
File "/home//bin/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__
File "/home//bin/spark/python/pyspark/sql/utils.py", line 134, in deco
raise_from(converted)
File "<string>", line 3, in raise_from
pyspark.sql.utils.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;
>>> spark.read.parquet('a.parquet')
DataFrame[_2: string, _1: double]
这是因为path
参数不存在。
使用load
>>> spark.read.load(path='a', format='parquet')
DataFrame[_1: string, _2: string]