PySpark 在读取 csv 时转义反斜杠和定界符

Question

我正在尝试在 PySpark 中读取 CSV，其中我的分隔符是“|”，但有些列有“\|”作为单元格中值的一部分。

CSV Data:
a|b|c|this should be \| one column

some_df = spark.read.csv(file, sep="|", quote="")
some_df.show()

输出：

+---+---+---+----------------+-----------+
|_c0|_c1|_c2|             _c3|        _c4|
+---+---+---+----------------+-----------+
| a | b | c |this should be \| one column|
+---+---+---+----------------+-----------+

预计：

+---+---+---+---------------------------+
|_c0|_c1|_c2|                        _c3|
+---+---+---+---------------------------+
| a | b | c |this should be \ one column|
+---+---+---+---------------------------+

Answer 1

>>> rdd  = sc.textFile("/.../file.csv")
>>> rdd.collect()
['a|b|c|this should be \| one column']

>>> rdd1  = rdd.map(lambda x: x.replace("\|", ""))
>>> rdd1.collect()
['a|b|c|this should be  one column']

>>> df = rdd1.map(lambda x: (x.split("|"))).map(lambda a : (a[0],a[1],a[2],a[3])).toDF(('col1', 'col2', 'col3', 'col4'))
>>> df.show(10,False)
+----+----+----+--------------------------+
|col1|col2|col3|col4                      |
+----+----+----+--------------------------+
|a   |b   |c   |this should be  one column|
+----+----+----+--------------------------+

PySpark 在读取 csv 时转义反斜杠和定界符

PySpark escape backslash and delimiter when reading csv

apache-spark

pyspark

pyspark-sql

pyspark-dataframes