在 pyspark 中读取包含字符串数组的 csv

Question

我正在尝试读取包含以下数据的 csv：

name,date,win,stops,cost
a,2020-1-1,true,"[""x"", ""y"", ""z""]", 2.3
b,2021-3-1,true,, 1.3
c,2023-2-1,true,"[""x""]", 0.3
d,2021-3-1,true,"[""z""]", 2.3

使用 inferSchema 会导致停止字段溢出到下一列并弄乱数据帧

如果我给出自己的架构，例如：

    schema = StructType([
    StructField('name', StringType()),
    StructField('date', TimestampType()),
    StructField('win', Booleantype()),
    StructField('stops', ArrayType(StringType())),
    StructField('cost', DoubleType())])

导致此异常：

pyspark.sql.utils.AnalysisException: CSV data source does not support array<string> data type.

那么如果没有这个失败，我该如何正确读取 csv？

Answer 1

我想这就是您要找的：

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('abc').getOrCreate()


dataframe = spark.read.options(header='True', delimiter=",").csv("file_name.csv")

dataframe.printSchema()

如果有帮助请告诉我

Answer 2

由于csv不支持数组，需要先读取为字符串，再进行转换

# You need to set escape option to ", since it is not the default escape character (\). 
df = spark.read.csv('file.csv', header=True, escape='"')

df = df.withColumn('stops', F.from_json('stops', ArrayType(StringType())))

在 pyspark 中读取包含字符串数组的 csv

Read csv that contains array of string in pyspark

apache-spark

pyspark