如何使用 pyspark 在引号中使用附加逗号读取 csv 文件？

Question

我在读取 UTF-16 中的以下 CSV 数据时遇到一些问题：

FullName, FullLabel, Type
TEST.slice, "Consideration":"Verde (Spar Verde, Fonte Verde)", Test,

据我了解，reader 应该不是问题，因为有一个 quote 参数来处理它。

df = spark.read.csv(file_path, header=True, encoding='UTF-16', quote = '"')

但是，这仍然会给我一个不正确的拆分：

是否有某种方法可以处理这些情况，或者我是否需要使用 RDD 解决它？

提前致谢。

Answer 1

您可以使用 spark.read.text 阅读文本并使用一些正则表达式拆分值以逗号分隔但忽略引号（您可以看到这个 post），然后从中获取相应的列结果数组：

from pyspark.sql import functions as F

df = spark.read.text(file_path)

df = df.filter("value != 'FullName, FullLabel, Type'") \
    .withColumn(
    "value",
    F.split(F.col("value"), ',(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)')
).select(
    F.col("value")[0].alias("FullName"),
    F.col("value")[1].alias("FullLabel"),
    F.col("value")[2].alias("Type")
)

df.show(truncate=False)

#+----------+--------------------------------------------------+-----+
#|FullName  |FullLabel                                         |Type |
#+----------+--------------------------------------------------+-----+
#|TEST.slice| "Consideration":"Verde (Spar Verde, Fonte Verde)"| Test|
#+----------+--------------------------------------------------+-----+

更新：

对于 utf-16 中的输入文件，您可以通过将文件加载为 binaryFiles 来替换 spark.read.text，然后将生成的 rdd 转换为数据帧：

df = sc.binaryFiles(file_path) \
    .flatMap(lambda x: [[l] for l in x[1].decode("utf-16").split("\n")]) \
    .toDF(["value"])

Answer 2

另一种选择如下（如果你觉得简单的话）：

首先将文本文件读取为RDD并将":"替换为~:~并保存文本文件。

sc.textFile(file_path).map(lambda x: x.replace('":"','~:~')).saveAsTextFile(tempPath)

接下来，读取临时路径并再次将 ~:~ 替换为 ":"，但这次是作为 DF。

from pyspark.sql import functions as F
spark.read.option('header','true').csv(tempPath).withColumn('FullLabel',F.regexp_replace(F.col('FullLabel'),'~:~','":"')).show(1, False)

+----------+-----------------------------------------------+----+
|FullName  |FullLabel                                      |Type|
+----------+-----------------------------------------------+----+
|TEST.slice|Consideration":"Verde (Spar Verde, Fonte Verde)|Test|
+----------+-----------------------------------------------+----+

如何使用 pyspark 在引号中使用附加逗号读取 csv 文件？

How to read csv file with additional comma in quotes using pyspark?

python

apache-spark

apache-spark-sql

pyspark

pyspark-dataframes