AWS EMR Spark Glue PySpark -
AWS EMR Spark Glue PySpark -
我正在尝试将 AWS Glue table 读入 pyspark。我得到一个 NullPointerException:
spark.sql("show tables").show()
+----------------+-----------------+-----------+
| database| tableName|isTemporary|
+----------------+-----------------+-----------+
|test_datalake_db|events2_2017_test| false|
|test_datalake_db| events2_old| false|
+----------------+-----------------+-----------+
接下来,我尝试从 table 中选择一些内容:
df = spark.sql("select * from events2_2017_test")
然而,事情变得一团糟:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/sql/session.py", line 603, in sql
return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o51.sql.
: java.lang.NullPointerException: Name is null
at java.lang.Enum.valueOf(Enum.java:236)
也失败了:
myDf = spark.table("test_datalake_db.events2_2017_test")
这是 table 架构:
答案是您需要在创建 table 时将 TableType 定义为 EXTERNAL_TABLE:
def create_table():
response = client.create_table(
DatabaseName='xxxx-datalake',
TableInput={
'Name': 'events2_2017_sample',
'Description': 'testing the events2 old data in s3',
'Owner': 'drew',
'Parameters': {
"classification": "csv"
},
'TableType': 'EXTERNAL_TABLE',
'StorageDescriptor': {
'Columns': [
{'Name': 'sys_vortex_id', 'Type': 'string'},
{'Name': 'sys_app_id', 'Type': 'string'},
{'Name': 'sys_pq_id', 'Type': 'string'},
{'Name': 'sys_ip_address', 'Type': 'string'},
{'Name': 'sys_submitted_at', 'Type': 'string'},
{'Name': 'sys_received_at', 'Type': 'string'},
{'Name': 'device_id_type', 'Type': 'string'},
{'Name': 'device_id', 'Type': 'string'},
{'Name': 'timezone', 'Type': 'string'},
{'Name': 'online', 'Type': 'string'},
{'Name': 'app_version', 'Type': 'string'},
{'Name': 'device_days', 'Type': 'string'},
{'Name': 'device_sessions', 'Type': 'string'},
{'Name': 'event_id', 'Type': 'string'}
],
'InputFormat': 'org.apache.hadoop.mapred.TextInputFormat',
'OutputFormat': 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat',
'SerdeInfo': {
'Name': 'noidea',
'SerializationLibrary': 'org.apache.hadoop.hive.serde2.OpenCSVSerde',
'Parameters': {
'separatorChar': '|'
}
},
'Location': 's3://xxxx-datalake-us-east-1/prd/101/27a89ba8-774a-4bc3-68ae/laboratory/events2_2017_test/',
'Compressed': True,
'StoredAsSubDirectories': False,
}
}
)
我正在尝试将 AWS Glue table 读入 pyspark。我得到一个 NullPointerException:
spark.sql("show tables").show()
+----------------+-----------------+-----------+
| database| tableName|isTemporary|
+----------------+-----------------+-----------+
|test_datalake_db|events2_2017_test| false|
|test_datalake_db| events2_old| false|
+----------------+-----------------+-----------+
接下来,我尝试从 table 中选择一些内容:
df = spark.sql("select * from events2_2017_test")
然而,事情变得一团糟:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/sql/session.py", line 603, in sql
return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o51.sql.
: java.lang.NullPointerException: Name is null
at java.lang.Enum.valueOf(Enum.java:236)
也失败了:
myDf = spark.table("test_datalake_db.events2_2017_test")
这是 table 架构:
答案是您需要在创建 table 时将 TableType 定义为 EXTERNAL_TABLE:
def create_table():
response = client.create_table(
DatabaseName='xxxx-datalake',
TableInput={
'Name': 'events2_2017_sample',
'Description': 'testing the events2 old data in s3',
'Owner': 'drew',
'Parameters': {
"classification": "csv"
},
'TableType': 'EXTERNAL_TABLE',
'StorageDescriptor': {
'Columns': [
{'Name': 'sys_vortex_id', 'Type': 'string'},
{'Name': 'sys_app_id', 'Type': 'string'},
{'Name': 'sys_pq_id', 'Type': 'string'},
{'Name': 'sys_ip_address', 'Type': 'string'},
{'Name': 'sys_submitted_at', 'Type': 'string'},
{'Name': 'sys_received_at', 'Type': 'string'},
{'Name': 'device_id_type', 'Type': 'string'},
{'Name': 'device_id', 'Type': 'string'},
{'Name': 'timezone', 'Type': 'string'},
{'Name': 'online', 'Type': 'string'},
{'Name': 'app_version', 'Type': 'string'},
{'Name': 'device_days', 'Type': 'string'},
{'Name': 'device_sessions', 'Type': 'string'},
{'Name': 'event_id', 'Type': 'string'}
],
'InputFormat': 'org.apache.hadoop.mapred.TextInputFormat',
'OutputFormat': 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat',
'SerdeInfo': {
'Name': 'noidea',
'SerializationLibrary': 'org.apache.hadoop.hive.serde2.OpenCSVSerde',
'Parameters': {
'separatorChar': '|'
}
},
'Location': 's3://xxxx-datalake-us-east-1/prd/101/27a89ba8-774a-4bc3-68ae/laboratory/events2_2017_test/',
'Compressed': True,
'StoredAsSubDirectories': False,
}
}
)