Pyspark writing data from databricks into azure sql: ValueError: Some of types cannot be determined after inferring
Pyspark writing data from databricks into azure sql: ValueError: Some of types cannot be determined after inferring
我正在使用 pyspark 将数据从 azure databricks 写入 azure sql。
代码运行良好,没有空值,但是当数据框包含空值时,我得到以下错误:
databricks/spark/python/pyspark/sql/pandas/conversion.py:300: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below:
Unable to convert the field Product. If this column is not necessary, you may consider dropping it or converting to primitive type before the conversion.
Context: Unsupported type in conversion from Arrow: null
Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.
warnings.warn(msg)
ValueError: Some of types cannot be determined after inferring
数据帧必须写入 sql,包括空值。我该如何解决?
sqlContext = SQLContext(sc)
def to_sql(df, table):
finaldf = sqlContext.createDataFrame(df)
finaldf.write.jdbc(url=url, table= table, mode ="overwrite", properties = properties)
to_sql(data, f"TF_{table.upper()}")
编辑:
解决它创建一个函数,将 pandas dtypes 映射到 sql dtypes 并将列和 dtypes 作为一个字符串输出。
def convert_dtype(df):
df_mssql = {'int64': 'bigint', 'object': 'varchar(200)', 'float64': 'float'}
mydict = {}
for col in df.columns:
if str(df.dtypes[col]) in df_mssql:
mydict[col] = df_mssql.get(str(df.dtypes[col]))
l = " ".join([str(k[0] + " " + k[1] + ",") for k in list(mydict.items())])
return l[:-1]
将此字符串传递给 createTableColumnTypes
选项解决了这种情况
jdbcDF.write \
.option("createTableColumnTypes", convert_dtype(df) \
.jdbc("jdbc:postgresql:dbserver", "schema.tablename",
properties={"user": "username", "password": "password"})
为此,您需要在写入语句中指定架构。这是文档中的示例,链接如下:
jdbcDF.write \
.option("createTableColumnTypes", "name CHAR(64), comments VARCHAR(1024)") \
.jdbc("jdbc:postgresql:dbserver", "schema.tablename",
properties={"user": "username", "password": "password"})
https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
我正在使用 pyspark 将数据从 azure databricks 写入 azure sql。 代码运行良好,没有空值,但是当数据框包含空值时,我得到以下错误:
databricks/spark/python/pyspark/sql/pandas/conversion.py:300: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below:
Unable to convert the field Product. If this column is not necessary, you may consider dropping it or converting to primitive type before the conversion.
Context: Unsupported type in conversion from Arrow: null
Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.
warnings.warn(msg)
ValueError: Some of types cannot be determined after inferring
数据帧必须写入 sql,包括空值。我该如何解决?
sqlContext = SQLContext(sc)
def to_sql(df, table):
finaldf = sqlContext.createDataFrame(df)
finaldf.write.jdbc(url=url, table= table, mode ="overwrite", properties = properties)
to_sql(data, f"TF_{table.upper()}")
编辑:
解决它创建一个函数,将 pandas dtypes 映射到 sql dtypes 并将列和 dtypes 作为一个字符串输出。
def convert_dtype(df):
df_mssql = {'int64': 'bigint', 'object': 'varchar(200)', 'float64': 'float'}
mydict = {}
for col in df.columns:
if str(df.dtypes[col]) in df_mssql:
mydict[col] = df_mssql.get(str(df.dtypes[col]))
l = " ".join([str(k[0] + " " + k[1] + ",") for k in list(mydict.items())])
return l[:-1]
将此字符串传递给 createTableColumnTypes
选项解决了这种情况
jdbcDF.write \
.option("createTableColumnTypes", convert_dtype(df) \
.jdbc("jdbc:postgresql:dbserver", "schema.tablename",
properties={"user": "username", "password": "password"})
为此,您需要在写入语句中指定架构。这是文档中的示例,链接如下:
jdbcDF.write \
.option("createTableColumnTypes", "name CHAR(64), comments VARCHAR(1024)") \
.jdbc("jdbc:postgresql:dbserver", "schema.tablename",
properties={"user": "username", "password": "password"})
https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html