Pyspark 解释使用和不使用自定义模式读取 csv 的区别

Pyspark explain difference with and without custom schema for reading csv

我正在读取具有 header 的 CSV 文件,但正在创建要读取的自定义架构。我想了解如果我提供模式或不提供模式,解释中是否存在差异。我对文档

上关于 read.csv 的声明感到好奇

Loads a CSV file and returns the result as a DataFrame. This function will go through the input once to determine the input schema if inferSchema is enabled. To avoid going through the entire data once, disable inferSchema option or specify the schema explicitly using schema.

与正在使用的 inferSchema 相比,当我提供架构时,我可以看到提示中的时间延迟。但我看不出解释功能有什么不同。下面是我的代码和提供架构的输出

>> friends_header_df = spark.read.csv(path='resources/fakefriends-header.csv',schema=custom_schems, header='true', sep=',')
>> print(friends_header_df._jdf.queryExecution().toString())
== Parsed Logical Plan ==
Relation[id#8,name#9,age#10,numFriends#11] csv

== Analyzed Logical Plan ==
id: int, name: string, age: int, numFriends: int
Relation[id#8,name#9,age#10,numFriends#11] csv

== Optimized Logical Plan ==
Relation[id#8,name#9,age#10,numFriends#11] csv

== Physical Plan ==
FileScan csv [id#8,name#9,age#10,numFriends#11] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex[file:/Users/sgudisa/Desktop/python data analysis workbook/spark-workbook/resour..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int,name:string,age:int,numFriends:int>

下面是使用 inferSchema 选项阅读的内容

>> friends_noschema_df = spark.read.csv(path='resources/fakefriends-header.csv',header='true',inferSchema='true',sep=',')
>> print(friends_noschema_df._jdf.queryExecution().toString())
== Parsed Logical Plan ==
Relation[userID#32,name#33,age#34,friends#35] csv

== Analyzed Logical Plan ==
userID: int, name: string, age: int, friends: int
Relation[userID#32,name#33,age#34,friends#35] csv

== Optimized Logical Plan ==
Relation[userID#32,name#33,age#34,friends#35] csv

== Physical Plan ==
FileScan csv [userID#32,name#33,age#34,friends#35] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex[file:/Users/sgudisa/Desktop/python data analysis workbook/spark-workbook/resour..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<userID:int,name:string,age:int,friends:int>

除了 Parsed Logical 计划中列的数字发生变化外,我没有看到任何关于 spark 一次读取所有数据的解释。

InferSchema = false 是默认选项。您将获得所有列作为 DF 的字符串。但是如果你提供一个模式,你就会得到你的输出。

推断模式意味着 Spark 将启动一个额外的工作 underwater 来做到这一点;事实上你可以看到。这将花费更长的时间,但正如您所说,您将看不到解释计划中的任何内容。 Underwater就是“水下”。