PySpark 在读取 csv 时转义反斜杠和定界符
PySpark escape backslash and delimiter when reading csv
我正在尝试在 PySpark 中读取 CSV,其中我的分隔符是“|”,但有些列有“\|”作为单元格中值的一部分。
CSV Data:
a|b|c|this should be \| one column
some_df = spark.read.csv(file, sep="|", quote="")
some_df.show()
输出:
+---+---+---+----------------+-----------+
|_c0|_c1|_c2| _c3| _c4|
+---+---+---+----------------+-----------+
| a | b | c |this should be \| one column|
+---+---+---+----------------+-----------+
预计:
+---+---+---+---------------------------+
|_c0|_c1|_c2| _c3|
+---+---+---+---------------------------+
| a | b | c |this should be \ one column|
+---+---+---+---------------------------+
>>> rdd = sc.textFile("/.../file.csv")
>>> rdd.collect()
['a|b|c|this should be \| one column']
>>> rdd1 = rdd.map(lambda x: x.replace("\|", ""))
>>> rdd1.collect()
['a|b|c|this should be one column']
>>> df = rdd1.map(lambda x: (x.split("|"))).map(lambda a : (a[0],a[1],a[2],a[3])).toDF(('col1', 'col2', 'col3', 'col4'))
>>> df.show(10,False)
+----+----+----+--------------------------+
|col1|col2|col3|col4 |
+----+----+----+--------------------------+
|a |b |c |this should be one column|
+----+----+----+--------------------------+
我正在尝试在 PySpark 中读取 CSV,其中我的分隔符是“|”,但有些列有“\|”作为单元格中值的一部分。
CSV Data:
a|b|c|this should be \| one column
some_df = spark.read.csv(file, sep="|", quote="")
some_df.show()
输出:
+---+---+---+----------------+-----------+
|_c0|_c1|_c2| _c3| _c4|
+---+---+---+----------------+-----------+
| a | b | c |this should be \| one column|
+---+---+---+----------------+-----------+
预计:
+---+---+---+---------------------------+
|_c0|_c1|_c2| _c3|
+---+---+---+---------------------------+
| a | b | c |this should be \ one column|
+---+---+---+---------------------------+
>>> rdd = sc.textFile("/.../file.csv")
>>> rdd.collect()
['a|b|c|this should be \| one column']
>>> rdd1 = rdd.map(lambda x: x.replace("\|", ""))
>>> rdd1.collect()
['a|b|c|this should be one column']
>>> df = rdd1.map(lambda x: (x.split("|"))).map(lambda a : (a[0],a[1],a[2],a[3])).toDF(('col1', 'col2', 'col3', 'col4'))
>>> df.show(10,False)
+----+----+----+--------------------------+
|col1|col2|col3|col4 |
+----+----+----+--------------------------+
|a |b |c |this should be one column|
+----+----+----+--------------------------+