Pyspark 没有获取自定义架构

Question

我正在测试这段代码。

from  pyspark.sql.functions import input_file_name
from pyspark.sql import SQLContext
from pyspark.sql.types import *
sqlContext = SQLContext(sc)


customSchema = StructType([ \
StructField("id", StringType(), True), \
StructField("date", StringType(), True), \
etc., etc., etc.
StructField("filename", StringType(), True)])



fullPath = "path_and_credentials_here"
df = sqlContext.read.format('com.databricks.spark.csv').options(header='false', schema = customSchema, delimiter='|').load(fullPath).withColumn("filename",input_file_name())

df.show()

现在，我的数据是竖线分隔的，第一行有一些元数据，也是竖线分隔的。奇怪的是自定义模式实际上被忽略了。文件第一行中的元数据没有应用我的自定义模式，而是控制着模式，这是完全错误的。这是我看到的景色。

+------------------+----------+------------+---------+--------------------+
|               _c0|       _c1|         _c2|      _c3|            filename|
+------------------+----------+------------+---------+--------------------+
|                CP|  20190628|    22:41:58|   001586|   abfss://rawdat...|
|          asset_id|price_date|price_source|bid_value|   abfss://rawdat...|
|             2e58f|  20190628|         CPN|  108.375|   abfss://rawdat...|
|             2e58f|  20190628|         FNR|     null|   abfss://rawdat...|

etc., etc., etc.

如何应用自定义架构？

Answer 1

您遇到的问题是因为您使用的是较旧的（不再维护的）CSV reader。请参阅标题 of the package.

下方的免责声明

如果您尝试新格式，它会起作用：

In [33]: !cat /tmp/data.csv
CP|12|12:13
a|b|c
10|12|13

In [34]: spark.read.csv(fullPath, header='false', schema = customSchema, sep='|').show()
+----+---+-----+
|name|foo|  bar|
+----+---+-----+
|  CP| 12|12:13|
|   a|  b|    c|
|  10| 12|   13|
+----+---+-----+

Pyspark 没有获取自定义架构

Pyspark Not Picking Up Custom Schema

python

dataframe

pyspark

pyspark-sql

pyspark-dataframes