如何使用 shell 脚本外部连接两个 CSV 文件?
How to outer-join two CSV files, using shell script?
我有两个 CSV 文件,如下所示:
file1.csv
label,"Part-A"
"ABC mn","2.0"
"XYZ","3.0"
"PQR SN","6"
file2.csv
label,"Part-B"
"XYZ","4.0"
"LMN Wv","8"
"PQR SN","6"
"EFG","1.0"
想要Output.csv
label,"Part-A","Part-B"
"ABC mn","2.0",NA
"EFG",NA,"1.0"
"LMN Wv",NA,"8"
"PQR SN","6","6"
"XYZ","3.0","4.0"
目前,使用下面的 awk 命令,我能够合并在 PQR 和 XYZ 等文件中具有标签条目的匹配项,但无法附加在两个文件中都没有标签值的文件:
awk -F, 'NR==FNR{a[]=substr([=14=],length()+2);next} ( in a){print [=14=]","a[]}' file1.csv file2.csv
由于您的问题的标题是“如何在 shell 脚本中进行操作?”不一定使用 awk,我将推荐 GoCSV,一个 command-line 工具,其中包含多个 sub-commands 用于处理 CSV(分隔文件)。
它没有一个命令可以完成您需要的,但您可以组合多个命令来获得正确的结果。
这个解决方案的核心是join命令,它可以执行内连接(默认)、左连接、右连接和外连接;你想要一个 outer join 来保留 non-overlapping 元素:
gocsv join -c 'label' -outer file1.csv file2.csv > joined.csv
echo 'Joined'
gocsv view joined.csv
Joined
+-------+--------+-------+--------+
| label | Part-A | label | Part-B |
+-------+--------+-------+--------+
| ABC | 2 | | |
+-------+--------+-------+--------+
| XYZ | 3 | XYZ | 4 |
+-------+--------+-------+--------+
| PQR | 6 | PQR | 6 |
+-------+--------+-------+--------+
| | | LMN | 8 |
+-------+--------+-------+--------+
| | | EFG | 1 |
+-------+--------+-------+--------+
data-part 是正确的,但要使列正确并在其中获取 NA
值需要一些工作。
这是一个完整的管道:
gocsv join -c 'label' -outer file1.csv file2.csv \
| gocsv rename -c 1 -names 'Label_A' \
| gocsv rename -c 3 -names 'Label_B' \
| gocsv add -name 'label' -t '{{ list .Label_A .Label_B | compact | first }}' \
| gocsv select -c 'label','Part-A','Part-B' \
| gocsv replace -c 'Part-A','Part-B' -regex '^$' -repl 'NA' \
| gocsv sort -c 'label' \
> final.csv
echo 'Final'
gocsv view final.csv
这为我们提供了正确的最终文件:
Final pipeline
+-------+--------+--------+
| label | Part-A | Part-B |
+-------+--------+--------+
| ABC | 2 | NA |
+-------+--------+--------+
| EFG | NA | 1 |
+-------+--------+--------+
| LMN | NA | 8 |
+-------+--------+--------+
| PQR | 6 | 6 |
+-------+--------+--------+
| XYZ | 3 | 4 |
+-------+--------+--------+
该管道中发生了很多事情,重点是:
合并两个标签字段
| gocsv rename -c 1 -names 'Label_A' \
| gocsv rename -c 3 -names 'Label_B' \
| gocsv add -name 'label' -t '{{ list .Label_A .Label_B | compact | first }}' \
Pare-down 到您想要的 3 列
| gocsv select -c 'label','Part-A','Part-B' \
添加 NA 值并按标签排序
| gocsv replace -c 'Part-A','Part-B' -regex '^$' -repl 'NA' \
| gocsv sort -c 'label' \
我在 this Gist 做了一个 step-by-step 的解释。
awk -v OFS=, '{
if(!o1[]) { o1[]=$NF; o2[]="NA" } else { o2[]=$NF }
}
END{
for(v in o1) { print v, o1[v], o2[v] }
}' file{1,2}
## output
LMN,8,NA
ABC,2,NA
PQR,6,6
EFG,1,NA
XYZ,3,4
我认为这会很好。
我想介绍一下 Miller to you. It is a tool that can do a few things with a few file formats and that is available as a stand-alone binary. You just have to download the archive,将 mlr
可执行文件放在某个地方(最好在您的 PATH 中),然后您就完成了安装。
mlr --csv \
join -f file1.csv -j 'label' --ul --ur \
then \
unsparsify --fill-with 'NA' \
then \
sort -f 'label' \
file2.csv
命令部分:
mlr --csv
表示您要读取 CSV 文件并输出 CSV 格式。作为另一个例子,如果你想读取 CSV 文件并输出 JSON 格式,它将是 mlr --icsv --ojson
join -f file1.csv -j 'label' --ul --ur
...... file2.csv
表示在字段 label
上加入 file1.csv
和 file2.csv
并发出两个文件的不匹配记录
then
是 Miller 的链式操作方式
unsparsify --fill-with 'NA'
意思是创建每个文件中不存在的字段,并用NA
填充它们。具有 uniq 标签的记录需要它
then
sort -f 'label'
表示对label
字段上的记录进行排序
关于更新的问题: mlr
自行处理 CSV 引用。与您的新预期输出的唯一区别是它删除了多余的引号:
label,Part-A,Part-B
ABC mn,2.0,NA
EFG,NA,1.0
LMN Wv,NA,8
PQR SN,6,6
XYZ,3.0,4.0
此解决方案使用任何 AWK 都可以准确地打印出预期的结果。
请注意,排序算法取自 mawk 手册。
# SO71053039.awk
#-------------------------------------------------
# insertion sort of A[1..n]
function isort( A,A_SWAP, n,i,j,hold ) {
n = 0
for (j in A)
A_SWAP[++n] = j
for( i = 2 ; i <= n ; i++)
{
hold = A_SWAP[j = i]
while ( A_SWAP[j-1] "" > "" hold )
{ j-- ; A_SWAP[j+1] = A_SWAP[j] }
A_SWAP[j] = hold
}
# sentinel A_SWAP[0] = "" will be created if needed
return n
}
BEGIN {
FS = OFS = ","
out = "Output.csv"
# read file 1
fnr = 0
while ((getline < ARGV[1]) > 0) {
++fnr
if (fnr == 1) {
for (i=1; i<=NF; i++)
FIELDBYNAME1[$i] = i # e.g. FIELDBYNAME1["label"] = 1
}
else {
LABEL_KEY[$FIELDBYNAME1["label"]]
LABEL_KEY1[$FIELDBYNAME1["label"]] = $FIELDBYNAME1["\"Part-A\""]
}
}
close(ARGV[1])
# read file2
fnr = 0
while ((getline < ARGV[2]) > 0) {
++fnr
if (fnr == 1) {
for (i=1; i<=NF; i++)
FIELDBYNAME2[$i] = i # e.g. FIELDBYNAME2["label"] = 1
}
else {
LABEL_KEY[$FIELDBYNAME2["label"]]
LABEL_KEY2[$FIELDBYNAME2["label"]] = $FIELDBYNAME2["\"Part-B\""]
}
}
close(ARGV[2])
# print the header
print "label" OFS "\"Part-A\"" OFS "\"Part-B\"" > out
# get the result
z = isort(LABEL_KEY, LABEL_KEY_SWAP)
for (i = 1; i <= z; i++) {
result_string = sprintf("%s", LABEL_KEY_SWAP[i])
if (LABEL_KEY_SWAP[i] in LABEL_KEY1)
result_string = sprintf("%s", result_string OFS LABEL_KEY1[LABEL_KEY_SWAP[i]] OFS (LABEL_KEY_SWAP[i] in LABEL_KEY2 ? LABEL_KEY2[LABEL_KEY_SWAP[i]] : "NA"))
else
result_string = sprintf("%s", result_string OFS "NA" OFS LABEL_KEY2[LABEL_KEY_SWAP[i]])
print result_string > out
}
}
通话:
awk -f SO71053039.awk file1.csv file2.csv
=> result file Output.csv with content:
label,"Part-A","Part-B"
"ABC mn","2.0",NA
"EFG",NA,"1.0"
"LMN Wv",NA,"8"
"PQR SN","6","6"
"XYZ","3.0","4.0"
你在我的另一个回答的评论中提到了 join,我忘记了这个实用程序:
#!/bin/sh
rm -f *sorted.csv
# Join two files, normally inner-join only, but
# - `-a 1 -a 2`: include "unpaired lines" from file 1 and file 2
# - `-1 1 -2 1`: the first column from each is the "join column"
# - `-o 0,1.2,2.2`: output the "join column" (0) and the second fields from files 1 and 2
join -a 1 -a 2 -1 1 -2 1 -o '0,1.2,2.2' -t, file1.csv file2.csv > joined.csv
# Add NA values
cat joined.csv | sed 's/,,/,NA,/' | sed 's/,$/,NA/' > unsorted.csv
# Sort, pull out header first
head -n 1 unsorted.csv > sorted.csv
# Then sort remainder
tail -n +2 unsorted.csv | sort -t, -k 1 >> sorted.csv
还有,这里是 sorted.csv
+--------+--------+--------+
| label | Part-A | Part-B |
+--------+--------+--------+
| ABC mn | 2.0 | NA |
+--------+--------+--------+
| EFG | NA | 1.0 |
+--------+--------+--------+
| LMN Wv | NA | 8 |
+--------+--------+--------+
| PQR SN | 6 | 6 |
+--------+--------+--------+
| XYZ | 3.0 | 4.0 |
+--------+--------+--------+
我们建议 gawk
标准脚本 Linux awk
:
script.awk
NR == FNR {
valsStr = sprintf("%s,%s", , "na");
rowsArr[] = valsStr;
}
NR != FNR && in rowsArr {
split(rowsArr[],valsArr);
valsStr = sprintf("%s,%s", valsArr[1], );
rowsArr[] = valsStr;
next;
}
NR != FNR {
valsStr = sprintf("%s,%s", "na", );
rowsArr[] = valsStr;
}
END {
printf("%s,%s\n", "label", rowsArr["label"]);
for (rowName in rowsArr) {
if (rowName == "label") continue;
printf("%s,%s\n", rowName, rowsArr[rowName]);
}
}
输出:
awk -F, -f script.awk input.{1,2}.txt
label,Part-A,Part-B
LMN,na,8
ABC,2,na
PQR,6,6
EFG,na,1
XYZ,3,4
正如@Fravadona 在他的评论中正确指出的那样,对于可以包含分隔符、换行符或字段内双引号的 CSV 文件,需要 适当的 CSV 解析器。
实际上,只需要两个函数:一个用于取消引用 CSV 字段到普通 AWK 字段,一个用于引用 AWK 字段到将数据写回 CSV 字段。
我已经写了一个我以前的答案的变体() that uses Ed Morton's CSV parser ( gsub 变体适用于任何 AWK 版本)来给出一个正确的 CSV 解析的例子:
此解决方案打印使用任何 AWK 正确排序的期望结果。
请注意,排序算法取自 mawk 手册。
# SO71053039_2.awk
# unquote CSV:
# Ed Morton's CSV parser:
function buildRec( fpat,fldNr,fldStr,done) {
CurrRec = CurrRec [=10=]
if ( gsub(/"/,"&",CurrRec) % 2 ) {
# The string built so far in CurrRec has an odd number
# of "s and so is not yet a complete record.
CurrRec = CurrRec RS
done = 0
}
else {
# If CurrRec ended with a null field we would exit the
# loop below before handling it so ensure that cannot happen.
# We use a regexp comparison using a bracket expression here
# and in fpat so it will work even if FS is a regexp metachar
# or a multi-char string like "\\" for \-separated fields.
CurrRec = CurrRec ( CurrRec ~ ("[" FS "]$") ? "\"\"" : "" )
[=10=] = ""
fpat = "([^" FS "]*)|(\"([^\"]|\"\")+\")"
while ( (CurrRec != "") && match(CurrRec,fpat) ) {
fldStr = substr(CurrRec,RSTART,RLENGTH)
# Convert <"foo"> to <foo> and <"foo""bar"> to <foo"bar>
if ( sub(/^"/,"",fldStr) && sub(/"$/,"",fldStr) ) {
gsub(/""/, "\"", fldStr)
}
$(++fldNr) = fldStr
CurrRec = substr(CurrRec,RSTART+RLENGTH+1)
}
CurrRec = ""
done = 1
}
return done
}
# quote CSV:
# Quote according to https://datatracker.ietf.org/doc/html/rfc4180 rules
function csvQuote(field, sep) {
if ((field ~ sep) || (field ~ /["\r\n]/)) {
gsub(/"/, "\"\"", field)
field = "\"" field "\""
}
return field
}
#-------------------------------------------------
# insertion sort of A[1..n]
function isort( A,A_SWAP, n,i,j,hold ) {
n = 0
for (j in A)
A_SWAP[++n] = j
for( i = 2 ; i <= n ; i++)
{
hold = A_SWAP[j = i]
while ( A_SWAP[j-1] "" > "" hold )
{ j-- ; A_SWAP[j+1] = A_SWAP[j] }
A_SWAP[j] = hold
}
# sentinel A_SWAP[0] = "" will be created if needed
return n
}
BEGIN {
FS = OFS = ","
# read file 1
fnr = 0
while ((getline < ARGV[1]) > 0) {
if (! buildRec())
continue
++fnr
if (fnr == 1) {
for (i=1; i<=NF; i++)
FIELDBYNAME1[$i] = i # e.g. FIELDBYNAME1["label"] = 1
}
else {
LABEL_KEY[$FIELDBYNAME1["label"]]
LABEL_KEY1[$FIELDBYNAME1["label"]] = $FIELDBYNAME1["Part-A"]
}
}
close(ARGV[1])
# read file2
fnr = 0
while ((getline < ARGV[2]) > 0) {
if (! buildRec())
continue
++fnr
if (fnr == 1) {
for (i=1; i<=NF; i++)
FIELDBYNAME2[$i] = i # e.g. FIELDBYNAME2["label"] = 1
}
else {
LABEL_KEY[$FIELDBYNAME2["label"]]
LABEL_KEY2[$FIELDBYNAME2["label"]] = $FIELDBYNAME2["Part-B"]
}
}
close(ARGV[2])
# print the header
print "label" OFS "Part-A" OFS "Part-B"
# get the result
z = isort(LABEL_KEY, LABEL_KEY_SWAP)
for (i = 1; i <= z; i++) {
result_string = sprintf("%s", csvQuote(LABEL_KEY_SWAP[i], OFS))
if (LABEL_KEY_SWAP[i] in LABEL_KEY1)
result_string = sprintf("%s", result_string OFS csvQuote(LABEL_KEY1[LABEL_KEY_SWAP[i]], OFS) OFS (LABEL_KEY_SWAP[i] in LABEL_KEY2 ? csvQuote(LABEL_KEY2[LABEL_KEY_SWAP[i]], OFS) : "NA"))
else
result_string = sprintf("%s", result_string OFS "NA" OFS csvQuote(LABEL_KEY2[LABEL_KEY_SWAP[i]], OFS))
print result_string
}
}
致电:
awk -f SO71053039_2.awk file1.csv file2.csv
=> result (superfluous quotes according to CSV rules are omitted):
label,Part-A,Part-B
ABC mn,2.0,NA
EFG,NA,1.0
LMN Wv,NA,8
PQR SN,6,6
XYZ,3.0,4.0
我有两个 CSV 文件,如下所示:
file1.csv
label,"Part-A"
"ABC mn","2.0"
"XYZ","3.0"
"PQR SN","6"
file2.csv
label,"Part-B"
"XYZ","4.0"
"LMN Wv","8"
"PQR SN","6"
"EFG","1.0"
想要Output.csv
label,"Part-A","Part-B"
"ABC mn","2.0",NA
"EFG",NA,"1.0"
"LMN Wv",NA,"8"
"PQR SN","6","6"
"XYZ","3.0","4.0"
目前,使用下面的 awk 命令,我能够合并在 PQR 和 XYZ 等文件中具有标签条目的匹配项,但无法附加在两个文件中都没有标签值的文件:
awk -F, 'NR==FNR{a[]=substr([=14=],length()+2);next} ( in a){print [=14=]","a[]}' file1.csv file2.csv
由于您的问题的标题是“如何在 shell 脚本中进行操作?”不一定使用 awk,我将推荐 GoCSV,一个 command-line 工具,其中包含多个 sub-commands 用于处理 CSV(分隔文件)。
它没有一个命令可以完成您需要的,但您可以组合多个命令来获得正确的结果。
这个解决方案的核心是join命令,它可以执行内连接(默认)、左连接、右连接和外连接;你想要一个 outer join 来保留 non-overlapping 元素:
gocsv join -c 'label' -outer file1.csv file2.csv > joined.csv
echo 'Joined'
gocsv view joined.csv
Joined
+-------+--------+-------+--------+
| label | Part-A | label | Part-B |
+-------+--------+-------+--------+
| ABC | 2 | | |
+-------+--------+-------+--------+
| XYZ | 3 | XYZ | 4 |
+-------+--------+-------+--------+
| PQR | 6 | PQR | 6 |
+-------+--------+-------+--------+
| | | LMN | 8 |
+-------+--------+-------+--------+
| | | EFG | 1 |
+-------+--------+-------+--------+
data-part 是正确的,但要使列正确并在其中获取 NA
值需要一些工作。
这是一个完整的管道:
gocsv join -c 'label' -outer file1.csv file2.csv \
| gocsv rename -c 1 -names 'Label_A' \
| gocsv rename -c 3 -names 'Label_B' \
| gocsv add -name 'label' -t '{{ list .Label_A .Label_B | compact | first }}' \
| gocsv select -c 'label','Part-A','Part-B' \
| gocsv replace -c 'Part-A','Part-B' -regex '^$' -repl 'NA' \
| gocsv sort -c 'label' \
> final.csv
echo 'Final'
gocsv view final.csv
这为我们提供了正确的最终文件:
Final pipeline
+-------+--------+--------+
| label | Part-A | Part-B |
+-------+--------+--------+
| ABC | 2 | NA |
+-------+--------+--------+
| EFG | NA | 1 |
+-------+--------+--------+
| LMN | NA | 8 |
+-------+--------+--------+
| PQR | 6 | 6 |
+-------+--------+--------+
| XYZ | 3 | 4 |
+-------+--------+--------+
该管道中发生了很多事情,重点是:
合并两个标签字段
| gocsv rename -c 1 -names 'Label_A' \
| gocsv rename -c 3 -names 'Label_B' \
| gocsv add -name 'label' -t '{{ list .Label_A .Label_B | compact | first }}' \
Pare-down 到您想要的 3 列
| gocsv select -c 'label','Part-A','Part-B' \
添加 NA 值并按标签排序
| gocsv replace -c 'Part-A','Part-B' -regex '^$' -repl 'NA' \
| gocsv sort -c 'label' \
我在 this Gist 做了一个 step-by-step 的解释。
awk -v OFS=, '{
if(!o1[]) { o1[]=$NF; o2[]="NA" } else { o2[]=$NF }
}
END{
for(v in o1) { print v, o1[v], o2[v] }
}' file{1,2}
## output
LMN,8,NA
ABC,2,NA
PQR,6,6
EFG,1,NA
XYZ,3,4
我认为这会很好。
我想介绍一下 Miller to you. It is a tool that can do a few things with a few file formats and that is available as a stand-alone binary. You just have to download the archive,将 mlr
可执行文件放在某个地方(最好在您的 PATH 中),然后您就完成了安装。
mlr --csv \
join -f file1.csv -j 'label' --ul --ur \
then \
unsparsify --fill-with 'NA' \
then \
sort -f 'label' \
file2.csv
命令部分:
mlr --csv
表示您要读取 CSV 文件并输出 CSV 格式。作为另一个例子,如果你想读取 CSV 文件并输出 JSON 格式,它将是mlr --icsv --ojson
join -f file1.csv -j 'label' --ul --ur
......file2.csv
表示在字段label
上加入file1.csv
和file2.csv
并发出两个文件的不匹配记录then
是 Miller 的链式操作方式unsparsify --fill-with 'NA'
意思是创建每个文件中不存在的字段,并用NA
填充它们。具有 uniq 标签的记录需要它then
sort -f 'label'
表示对label
字段上的记录进行排序
关于更新的问题: mlr
自行处理 CSV 引用。与您的新预期输出的唯一区别是它删除了多余的引号:
label,Part-A,Part-B
ABC mn,2.0,NA
EFG,NA,1.0
LMN Wv,NA,8
PQR SN,6,6
XYZ,3.0,4.0
此解决方案使用任何 AWK 都可以准确地打印出预期的结果。 请注意,排序算法取自 mawk 手册。
# SO71053039.awk
#-------------------------------------------------
# insertion sort of A[1..n]
function isort( A,A_SWAP, n,i,j,hold ) {
n = 0
for (j in A)
A_SWAP[++n] = j
for( i = 2 ; i <= n ; i++)
{
hold = A_SWAP[j = i]
while ( A_SWAP[j-1] "" > "" hold )
{ j-- ; A_SWAP[j+1] = A_SWAP[j] }
A_SWAP[j] = hold
}
# sentinel A_SWAP[0] = "" will be created if needed
return n
}
BEGIN {
FS = OFS = ","
out = "Output.csv"
# read file 1
fnr = 0
while ((getline < ARGV[1]) > 0) {
++fnr
if (fnr == 1) {
for (i=1; i<=NF; i++)
FIELDBYNAME1[$i] = i # e.g. FIELDBYNAME1["label"] = 1
}
else {
LABEL_KEY[$FIELDBYNAME1["label"]]
LABEL_KEY1[$FIELDBYNAME1["label"]] = $FIELDBYNAME1["\"Part-A\""]
}
}
close(ARGV[1])
# read file2
fnr = 0
while ((getline < ARGV[2]) > 0) {
++fnr
if (fnr == 1) {
for (i=1; i<=NF; i++)
FIELDBYNAME2[$i] = i # e.g. FIELDBYNAME2["label"] = 1
}
else {
LABEL_KEY[$FIELDBYNAME2["label"]]
LABEL_KEY2[$FIELDBYNAME2["label"]] = $FIELDBYNAME2["\"Part-B\""]
}
}
close(ARGV[2])
# print the header
print "label" OFS "\"Part-A\"" OFS "\"Part-B\"" > out
# get the result
z = isort(LABEL_KEY, LABEL_KEY_SWAP)
for (i = 1; i <= z; i++) {
result_string = sprintf("%s", LABEL_KEY_SWAP[i])
if (LABEL_KEY_SWAP[i] in LABEL_KEY1)
result_string = sprintf("%s", result_string OFS LABEL_KEY1[LABEL_KEY_SWAP[i]] OFS (LABEL_KEY_SWAP[i] in LABEL_KEY2 ? LABEL_KEY2[LABEL_KEY_SWAP[i]] : "NA"))
else
result_string = sprintf("%s", result_string OFS "NA" OFS LABEL_KEY2[LABEL_KEY_SWAP[i]])
print result_string > out
}
}
通话:
awk -f SO71053039.awk file1.csv file2.csv
=> result file Output.csv with content:
label,"Part-A","Part-B"
"ABC mn","2.0",NA
"EFG",NA,"1.0"
"LMN Wv",NA,"8"
"PQR SN","6","6"
"XYZ","3.0","4.0"
你在我的另一个回答的评论中提到了 join,我忘记了这个实用程序:
#!/bin/sh
rm -f *sorted.csv
# Join two files, normally inner-join only, but
# - `-a 1 -a 2`: include "unpaired lines" from file 1 and file 2
# - `-1 1 -2 1`: the first column from each is the "join column"
# - `-o 0,1.2,2.2`: output the "join column" (0) and the second fields from files 1 and 2
join -a 1 -a 2 -1 1 -2 1 -o '0,1.2,2.2' -t, file1.csv file2.csv > joined.csv
# Add NA values
cat joined.csv | sed 's/,,/,NA,/' | sed 's/,$/,NA/' > unsorted.csv
# Sort, pull out header first
head -n 1 unsorted.csv > sorted.csv
# Then sort remainder
tail -n +2 unsorted.csv | sort -t, -k 1 >> sorted.csv
还有,这里是 sorted.csv
+--------+--------+--------+
| label | Part-A | Part-B |
+--------+--------+--------+
| ABC mn | 2.0 | NA |
+--------+--------+--------+
| EFG | NA | 1.0 |
+--------+--------+--------+
| LMN Wv | NA | 8 |
+--------+--------+--------+
| PQR SN | 6 | 6 |
+--------+--------+--------+
| XYZ | 3.0 | 4.0 |
+--------+--------+--------+
我们建议 gawk
标准脚本 Linux awk
:
script.awk
NR == FNR {
valsStr = sprintf("%s,%s", , "na");
rowsArr[] = valsStr;
}
NR != FNR && in rowsArr {
split(rowsArr[],valsArr);
valsStr = sprintf("%s,%s", valsArr[1], );
rowsArr[] = valsStr;
next;
}
NR != FNR {
valsStr = sprintf("%s,%s", "na", );
rowsArr[] = valsStr;
}
END {
printf("%s,%s\n", "label", rowsArr["label"]);
for (rowName in rowsArr) {
if (rowName == "label") continue;
printf("%s,%s\n", rowName, rowsArr[rowName]);
}
}
输出:
awk -F, -f script.awk input.{1,2}.txt
label,Part-A,Part-B
LMN,na,8
ABC,2,na
PQR,6,6
EFG,na,1
XYZ,3,4
正如@Fravadona 在他的评论中正确指出的那样,对于可以包含分隔符、换行符或字段内双引号的 CSV 文件,需要 适当的 CSV 解析器。
实际上,只需要两个函数:一个用于取消引用 CSV 字段到普通 AWK 字段,一个用于引用 AWK 字段到将数据写回 CSV 字段。
我已经写了一个我以前的答案的变体(
此解决方案打印使用任何 AWK 正确排序的期望结果。 请注意,排序算法取自 mawk 手册。
# SO71053039_2.awk
# unquote CSV:
# Ed Morton's CSV parser:
function buildRec( fpat,fldNr,fldStr,done) {
CurrRec = CurrRec [=10=]
if ( gsub(/"/,"&",CurrRec) % 2 ) {
# The string built so far in CurrRec has an odd number
# of "s and so is not yet a complete record.
CurrRec = CurrRec RS
done = 0
}
else {
# If CurrRec ended with a null field we would exit the
# loop below before handling it so ensure that cannot happen.
# We use a regexp comparison using a bracket expression here
# and in fpat so it will work even if FS is a regexp metachar
# or a multi-char string like "\\" for \-separated fields.
CurrRec = CurrRec ( CurrRec ~ ("[" FS "]$") ? "\"\"" : "" )
[=10=] = ""
fpat = "([^" FS "]*)|(\"([^\"]|\"\")+\")"
while ( (CurrRec != "") && match(CurrRec,fpat) ) {
fldStr = substr(CurrRec,RSTART,RLENGTH)
# Convert <"foo"> to <foo> and <"foo""bar"> to <foo"bar>
if ( sub(/^"/,"",fldStr) && sub(/"$/,"",fldStr) ) {
gsub(/""/, "\"", fldStr)
}
$(++fldNr) = fldStr
CurrRec = substr(CurrRec,RSTART+RLENGTH+1)
}
CurrRec = ""
done = 1
}
return done
}
# quote CSV:
# Quote according to https://datatracker.ietf.org/doc/html/rfc4180 rules
function csvQuote(field, sep) {
if ((field ~ sep) || (field ~ /["\r\n]/)) {
gsub(/"/, "\"\"", field)
field = "\"" field "\""
}
return field
}
#-------------------------------------------------
# insertion sort of A[1..n]
function isort( A,A_SWAP, n,i,j,hold ) {
n = 0
for (j in A)
A_SWAP[++n] = j
for( i = 2 ; i <= n ; i++)
{
hold = A_SWAP[j = i]
while ( A_SWAP[j-1] "" > "" hold )
{ j-- ; A_SWAP[j+1] = A_SWAP[j] }
A_SWAP[j] = hold
}
# sentinel A_SWAP[0] = "" will be created if needed
return n
}
BEGIN {
FS = OFS = ","
# read file 1
fnr = 0
while ((getline < ARGV[1]) > 0) {
if (! buildRec())
continue
++fnr
if (fnr == 1) {
for (i=1; i<=NF; i++)
FIELDBYNAME1[$i] = i # e.g. FIELDBYNAME1["label"] = 1
}
else {
LABEL_KEY[$FIELDBYNAME1["label"]]
LABEL_KEY1[$FIELDBYNAME1["label"]] = $FIELDBYNAME1["Part-A"]
}
}
close(ARGV[1])
# read file2
fnr = 0
while ((getline < ARGV[2]) > 0) {
if (! buildRec())
continue
++fnr
if (fnr == 1) {
for (i=1; i<=NF; i++)
FIELDBYNAME2[$i] = i # e.g. FIELDBYNAME2["label"] = 1
}
else {
LABEL_KEY[$FIELDBYNAME2["label"]]
LABEL_KEY2[$FIELDBYNAME2["label"]] = $FIELDBYNAME2["Part-B"]
}
}
close(ARGV[2])
# print the header
print "label" OFS "Part-A" OFS "Part-B"
# get the result
z = isort(LABEL_KEY, LABEL_KEY_SWAP)
for (i = 1; i <= z; i++) {
result_string = sprintf("%s", csvQuote(LABEL_KEY_SWAP[i], OFS))
if (LABEL_KEY_SWAP[i] in LABEL_KEY1)
result_string = sprintf("%s", result_string OFS csvQuote(LABEL_KEY1[LABEL_KEY_SWAP[i]], OFS) OFS (LABEL_KEY_SWAP[i] in LABEL_KEY2 ? csvQuote(LABEL_KEY2[LABEL_KEY_SWAP[i]], OFS) : "NA"))
else
result_string = sprintf("%s", result_string OFS "NA" OFS csvQuote(LABEL_KEY2[LABEL_KEY_SWAP[i]], OFS))
print result_string
}
}
致电:
awk -f SO71053039_2.awk file1.csv file2.csv
=> result (superfluous quotes according to CSV rules are omitted):
label,Part-A,Part-B
ABC mn,2.0,NA
EFG,NA,1.0
LMN Wv,NA,8
PQR SN,6,6
XYZ,3.0,4.0