为什么在 PySpark 中有两种读取 CSV 文件的选项？我应该使用哪一个？

Question

Spark 2.4.4:

我想导入 CSV 文件，但有两种选择。这是为什么？哪个更好？我应该使用哪一个？

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .master("local[2]") \
    .config('spark.cores.max', '3') \
    .config('spark.executor.memory', '2g') \
    .config('spark.executor.cores', '2') \
    .config('spark.driver.memory','1g') \
    .getOrCreate()

选项 1

df = spark.read \
    .format("com.databricks.spark.csv") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load("data/myfile.csv")

选项 2

df = spark.read.load("data/myfile.csv", format="csv", inferSchema="true", header="true")

Answer 1

从 Spark 2 开始，com.databricks.spark.csv 不需要完全写出，因为包含 CSV reader。因此，选项 2 将是首选。

或稍短，

spark.read.csv("data/myfile.csv", inferSchema=True, header=True)

但是如果您将输入格式提取到某个配置文件中，选项 2 会更好

Answer 2

在所有语言中（天气预报或对话），总有几种不同的方法可以达到相同的目的。

读取 CSV 文件时的选项

Spark CSV dataset provides multiple options to work with CSV files, all these options 
delimiter

delimiter option is used to specify the column delimiter of the CSV file. By default, it is comma (,) character, but can be set to any character us this option.


val df2 = spark.read.options(Map("delimiter"->","))
  .csv("src/main/resources/zipcodes.csv")

inferSchema

The default value set to this option is false, when set to true it automatically infer column types based on the data. It requires to read the data one more time to infer the schema.


val df2 = spark.read.options(Map("inferSchema"->"true","delimiter"->","))
  .csv("src/main/resources/zipcodes.csv")

header

This option is used to read the first line of the CSV file as column names. By default the value of this option is false , and all column types are assumed to be a string.


val df2 = spark.read.options(Map("inferSchema"->"true","delimiter"->",","header"->"true"))
  .csv("src/main/resources/zipcodes.csv")

quotes

When you have a column with a delimiter that used to split the columns, use quotes option to specify the quote character, by default it is ” and delimiters inside quotes are ignored. but using this option you can set any character.
nullValues

Using nullValues option you can specify the string in a CSV to consider as null. For example, if you want to consider a date column with a value “1900-01-01” set null on DataFrame.
dateFormat

dateFormat option to used to set the format of the input DateType and TimestampType columns. Supports all java.text.SimpleDateFormat formats.

Note: Besides the above options, Spark CSV dataset also supports many other options, please refer to this article for details.

使用用户指定的自定义架构读取 CSV 文件

If you know the schema of the file ahead and do not want to use the inferSchema option for column names and types, use user-defined custom column names and type using schema option.


    val schema = new StructType()
      .add("RecordNumber",IntegerType,true)
      .add("Zipcode",IntegerType,true)
      .add("City",StringType,true)
      .add("State",StringType,true)
      .add("Notes",StringType,true)
    val df_with_schema = spark.read.format("csv")
      .option("header", "true")
      .schema(schema)
      .load("src/main/resources/zipcodes.csv")
    df_with_schema.printSchema()
    df_with_schema.show(false)

https://sparkbyexamples.com/spark/spark-read-csv-file-into-dataframe/

为什么在 PySpark 中有两种读取 CSV 文件的选项？我应该使用哪一个？

Why are there two options to read a CSV file in PySpark? Which one should I use?

python

apache-spark

pyspark

apache-spark-2.0

选项 1

选项 2