在 pyspark 中读取包含字符串数组的 csv
Read csv that contains array of string in pyspark
我正在尝试读取包含以下数据的 csv:
name,date,win,stops,cost
a,2020-1-1,true,"[""x"", ""y"", ""z""]", 2.3
b,2021-3-1,true,, 1.3
c,2023-2-1,true,"[""x""]", 0.3
d,2021-3-1,true,"[""z""]", 2.3
使用 inferSchema 会导致停止字段溢出到下一列并弄乱数据帧
如果我给出自己的架构,例如:
schema = StructType([
StructField('name', StringType()),
StructField('date', TimestampType()),
StructField('win', Booleantype()),
StructField('stops', ArrayType(StringType())),
StructField('cost', DoubleType())])
导致此异常:
pyspark.sql.utils.AnalysisException: CSV data source does not support array<string> data type.
那么如果没有这个失败,我该如何正确读取 csv?
我想这就是您要找的:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('abc').getOrCreate()
dataframe = spark.read.options(header='True', delimiter=",").csv("file_name.csv")
dataframe.printSchema()
如果有帮助请告诉我
由于csv不支持数组,需要先读取为字符串,再进行转换
# You need to set escape option to ", since it is not the default escape character (\).
df = spark.read.csv('file.csv', header=True, escape='"')
df = df.withColumn('stops', F.from_json('stops', ArrayType(StringType())))
我正在尝试读取包含以下数据的 csv:
name,date,win,stops,cost
a,2020-1-1,true,"[""x"", ""y"", ""z""]", 2.3
b,2021-3-1,true,, 1.3
c,2023-2-1,true,"[""x""]", 0.3
d,2021-3-1,true,"[""z""]", 2.3
使用 inferSchema 会导致停止字段溢出到下一列并弄乱数据帧
如果我给出自己的架构,例如:
schema = StructType([
StructField('name', StringType()),
StructField('date', TimestampType()),
StructField('win', Booleantype()),
StructField('stops', ArrayType(StringType())),
StructField('cost', DoubleType())])
导致此异常:
pyspark.sql.utils.AnalysisException: CSV data source does not support array<string> data type.
那么如果没有这个失败,我该如何正确读取 csv?
我想这就是您要找的:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('abc').getOrCreate()
dataframe = spark.read.options(header='True', delimiter=",").csv("file_name.csv")
dataframe.printSchema()
如果有帮助请告诉我
由于csv不支持数组,需要先读取为字符串,再进行转换
# You need to set escape option to ", since it is not the default escape character (\).
df = spark.read.csv('file.csv', header=True, escape='"')
df = df.withColumn('stops', F.from_json('stops', ArrayType(StringType())))