将其余 api 获取方法响应保存为 json 文档
save rest api get method response as a json document
我正在使用下面的代码从 rest api 读取并将响应写入 pyspark 中的 json 文档,并将文件保存到 Azure Data Lake Gen2。当响应没有空白数据时,代码工作正常,但是当我尝试取回所有数据时 运行 出现以下错误。
错误消息:ValueError:推断后无法确定某些类型。
代码:
import requests
response = requests.get('https://apiurl.com/demo/api/v3/data',
auth=('user', 'password'))
data = response.json()
from pyspark.sql import *
df=spark.createDataFrame([Row(**i) for i in data])
df.show()
df.write.mode("overwrite").json("wasbs://<file_system>@<storage-account-name>.blob.core.windows.net/demo/data")
响应:
[
{
"ProductID": "156528",
"ProductType": "Home Improvement",
"Description": "",
"SaleDate": "0001-01-01T00:00:00",
"UpdateDate": "2015-02-01T16:43:18.247"
},
{
"ProductID": "126789",
"ProductType": "Pharmacy",
"Description": "",
"SaleDate": "0001-01-01T00:00:00",
"UpdateDate": "2015-02-01T16:43:18.247"
}
]
正在尝试修复如下架构。
from pyspark.sql.types import StructType, StructField, StringType
schema = StructType([StructField("ProductID", StringType(), True), StructField("ProductType", StringType(), True), "Description", StringType(), True), StructField("SaleDate", StringType(), True), StructField("UpdateDate", StringType(), True)])
df = spark.createDataFrame([[None, None, None, None, None]], schema=schema)
df.show()
不确定如何创建数据框并将数据写入 json 文档。
您可以将 data
,schema
变量传递给 spark.createDataFrame
() 然后 spark 将创建一个数据框。
Example:
from pyspark.sql.functions import *
from pyspark.sql import *
from pyspark.sql.types import *
data=[
{
"ProductID": "156528",
"ProductType": "Home Improvement",
"Description": "",
"SaleDate": "0001-01-01T00:00:00",
"UpdateDate": "2015-02-01T16:43:18.247"
},
{
"ProductID": "126789",
"ProductType": "Pharmacy",
"Description": "",
"SaleDate": "0001-01-01T00:00:00",
"UpdateDate": "2015-02-01T16:43:18.247"
}
]
schema = StructType([StructField("ProductID", StringType(), True), StructField("ProductType", StringType(), True), StructField("Description", StringType(), True), StructField("SaleDate", StringType(), True), StructField("UpdateDate", StringType(), True)])
df = spark.createDataFrame(data, schema=schema)
df.show()
#+---------+----------------+-----------+-------------------+--------------------+
#|ProductID| ProductType|Description| SaleDate| UpdateDate|
#+---------+----------------+-----------+-------------------+--------------------+
#| 156528|Home Improvement| |0001-01-01T00:00:00|2015-02-01T16:43:...|
#| 126789| Pharmacy| |0001-01-01T00:00:00|2015-02-01T16:43:...|
#+---------+----------------+-----------+-------------------+--------------------+
我正在使用下面的代码从 rest api 读取并将响应写入 pyspark 中的 json 文档,并将文件保存到 Azure Data Lake Gen2。当响应没有空白数据时,代码工作正常,但是当我尝试取回所有数据时 运行 出现以下错误。
错误消息:ValueError:推断后无法确定某些类型。
代码:
import requests
response = requests.get('https://apiurl.com/demo/api/v3/data',
auth=('user', 'password'))
data = response.json()
from pyspark.sql import *
df=spark.createDataFrame([Row(**i) for i in data])
df.show()
df.write.mode("overwrite").json("wasbs://<file_system>@<storage-account-name>.blob.core.windows.net/demo/data")
响应:
[
{
"ProductID": "156528",
"ProductType": "Home Improvement",
"Description": "",
"SaleDate": "0001-01-01T00:00:00",
"UpdateDate": "2015-02-01T16:43:18.247"
},
{
"ProductID": "126789",
"ProductType": "Pharmacy",
"Description": "",
"SaleDate": "0001-01-01T00:00:00",
"UpdateDate": "2015-02-01T16:43:18.247"
}
]
正在尝试修复如下架构。
from pyspark.sql.types import StructType, StructField, StringType
schema = StructType([StructField("ProductID", StringType(), True), StructField("ProductType", StringType(), True), "Description", StringType(), True), StructField("SaleDate", StringType(), True), StructField("UpdateDate", StringType(), True)])
df = spark.createDataFrame([[None, None, None, None, None]], schema=schema)
df.show()
不确定如何创建数据框并将数据写入 json 文档。
您可以将 data
,schema
变量传递给 spark.createDataFrame
() 然后 spark 将创建一个数据框。
Example:
from pyspark.sql.functions import *
from pyspark.sql import *
from pyspark.sql.types import *
data=[
{
"ProductID": "156528",
"ProductType": "Home Improvement",
"Description": "",
"SaleDate": "0001-01-01T00:00:00",
"UpdateDate": "2015-02-01T16:43:18.247"
},
{
"ProductID": "126789",
"ProductType": "Pharmacy",
"Description": "",
"SaleDate": "0001-01-01T00:00:00",
"UpdateDate": "2015-02-01T16:43:18.247"
}
]
schema = StructType([StructField("ProductID", StringType(), True), StructField("ProductType", StringType(), True), StructField("Description", StringType(), True), StructField("SaleDate", StringType(), True), StructField("UpdateDate", StringType(), True)])
df = spark.createDataFrame(data, schema=schema)
df.show()
#+---------+----------------+-----------+-------------------+--------------------+
#|ProductID| ProductType|Description| SaleDate| UpdateDate|
#+---------+----------------+-----------+-------------------+--------------------+
#| 156528|Home Improvement| |0001-01-01T00:00:00|2015-02-01T16:43:...|
#| 126789| Pharmacy| |0001-01-01T00:00:00|2015-02-01T16:43:...|
#+---------+----------------+-----------+-------------------+--------------------+