使用 PySpark 、Python 或 Shell 解析多行 CSV

Question

输入（2 列）：

col1 , col2
David, 100
"Ronald
Sr, Ron , Ram" , 200
Harry
potter
jr" , 200
Prof.
Snape" , 100

注意：Harry 和 Prof. 没有起始引号

输出（2 列）

col1 | col2
David | 100
Ronald Sr , Ron , Ram| 200
Harry potter jr| 200 
Prof. Snape| 100

我尝试了什么 (PySpark)？

df = spark.read.format("csv").option("header",True).option("multiLine",True).option("escape","\'")

问题上面的代码在 multiline 有开始和结束双引号的情况下工作正常（例如：以 Ronald 开头的行）

但它不适用于我们只有结束引号但没有开始引号的行（比如 Harry 和 Prof）

即使我们添加 Harry 和 Prof 的开始引号也能解决问题

欢迎任何使用 Pyspark、Python 或 Shell 等的想法！！

Answer 1

仅基于提供的小样本：

删除所有双引号
有两个comma-delimited字段；第一个字段是一个字符串，第二个字段是一个数字
第一个字段可能包含逗号，并且可能被分成多行
用竖线替换逗号分隔符 (|)
OP 的预期输出与新插入管道之前的间距不一致 (|)；有时删除 space，有时插入 space；现在我们不用担心间距

一个awk想法：

awk -F, '
             { gsub(/"/,"") }                      # remove double quotes
FNR==1 ||                                          # if 1st line or last field is a number then ...
($NF+0)==$NF { print prev gensub(FS,"|",(NF-1))    # print any previous line(s) data plus current line, replacing last comma with a pipe
               prev=""                             # clear previous line(s) data
               next                                # skip to next line of input
             }
             { prev= prev [=10=] " " }                 # if we get here then this is a broken line so save contents for later printing
' sample.csv

这会生成：

col1 | col2
David| 100
Ronald Sr, Ron , Ram | 200
Harry potter jr | 200
Prof. Snape | 100

使用 PySpark 、Python 或 Shell 解析多行 CSV

Parse multiple line CSV using PySpark , Python or Shell

python

csv

shell

awk

pyspark