如何使用 shell 脚本外部连接两个 CSV 文件?

How to outer-join two CSV files, using shell script?

我有两个 CSV 文件,如下所示:

file1.csv

label,"Part-A"
"ABC mn","2.0"
"XYZ","3.0"
"PQR SN","6"

file2.csv

label,"Part-B"
"XYZ","4.0"
"LMN Wv","8"
"PQR SN","6"
"EFG","1.0"

想要Output.csv

label,"Part-A","Part-B"
"ABC mn","2.0",NA
"EFG",NA,"1.0"
"LMN Wv",NA,"8"
"PQR SN","6","6"
"XYZ","3.0","4.0"

目前,使用下面的 awk 命令,我能够合并在 PQR 和 XYZ 等文件中具有标签条目的匹配项,但无法附加在两个文件中都没有标签值的文件:

awk -F, 'NR==FNR{a[]=substr([=14=],length()+2);next} ( in a){print [=14=]","a[]}' file1.csv file2.csv

由于您的问题的标题是“如何在 shell 脚本中进行操作?”不一定使用 awk,我将推荐 GoCSV,一个 command-line 工具,其中包含多个 sub-commands 用于处理 CSV(分隔文件)。

它没有一个命令可以完成您需要的,但您可以组合多个命令来获得正确的结果。

这个解决方案的核心是join命令,它可以执行内连接(默认)、左连接、右连接和外连接;你想要一个 outer join 来保留 non-overlapping 元素:

gocsv join -c 'label' -outer file1.csv file2.csv > joined.csv
echo 'Joined'
gocsv view joined.csv
Joined
+-------+--------+-------+--------+
| label | Part-A | label | Part-B |
+-------+--------+-------+--------+
| ABC   | 2      |       |        |
+-------+--------+-------+--------+
| XYZ   | 3      | XYZ   | 4      |
+-------+--------+-------+--------+
| PQR   | 6      | PQR   | 6      |
+-------+--------+-------+--------+
|       |        | LMN   | 8      |
+-------+--------+-------+--------+
|       |        | EFG   | 1      |
+-------+--------+-------+--------+

data-part 是正确的,但要使列正确并在其中获取 NA 值需要一些工作。

这是一个完整的管道:

gocsv join -c 'label' -outer file1.csv file2.csv \
| gocsv rename -c 1 -names 'Label_A' \
| gocsv rename -c 3 -names 'Label_B' \
| gocsv add -name 'label' -t '{{ list .Label_A .Label_B | compact | first }}' \
| gocsv select -c 'label','Part-A','Part-B' \
| gocsv replace -c 'Part-A','Part-B' -regex '^$' -repl 'NA' \
| gocsv sort -c 'label' \
> final.csv

echo 'Final'
gocsv view final.csv

这为我们提供了正确的最终文件:

Final pipeline
+-------+--------+--------+
| label | Part-A | Part-B |
+-------+--------+--------+
| ABC   | 2      | NA     |
+-------+--------+--------+
| EFG   | NA     | 1      |
+-------+--------+--------+
| LMN   | NA     | 8      |
+-------+--------+--------+
| PQR   | 6      | 6      |
+-------+--------+--------+
| XYZ   | 3      | 4      |
+-------+--------+--------+

该管道中发生了很多事情,重点是:

合并两个标签字段

| gocsv rename -c 1 -names 'Label_A' \
| gocsv rename -c 3 -names 'Label_B' \
| gocsv add -name 'label' -t '{{ list .Label_A .Label_B | compact | first }}' \

Pare-down 到您想要的 3 列

| gocsv select -c 'label','Part-A','Part-B' \

添加 NA 值并按标签排序

| gocsv replace -c 'Part-A','Part-B' -regex '^$' -repl 'NA' \
| gocsv sort -c 'label' \

我在 this Gist 做了一个 step-by-step 的解释。

awk -v OFS=, '{
        if(!o1[]) { o1[]=$NF; o2[]="NA" } else { o2[]=$NF }
    } 
    END{
        for(v in o1) { print v, o1[v], o2[v] }
    }' file{1,2}

## output
LMN,8,NA
ABC,2,NA
PQR,6,6
EFG,1,NA
XYZ,3,4

我认为这会很好。

我想介绍一下 Miller to you. It is a tool that can do a few things with a few file formats and that is available as a stand-alone binary. You just have to download the archive,将 mlr 可执行文件放在某个地方(最好在您的 PATH 中),然后您就完成了安装。

mlr --csv \
    join -f file1.csv -j 'label' --ul --ur \
    then \
    unsparsify --fill-with 'NA' \
    then \
    sort -f 'label' \
    file2.csv

命令部分:

  • mlr --csv
    表示您要读取 CSV 文件并输出 CSV 格式。作为另一个例子,如果你想读取 CSV 文件并输出 JSON 格式,它将是 mlr --icsv --ojson
  • join -f file1.csv -j 'label' --ul --ur ...... file2.csv
    表示在字段 label 上加入 file1.csvfile2.csv 并发出两个文件的不匹配记录
  • then 是 Miller 的链式操作方式
  • unsparsify --fill-with 'NA'
    意思是创建每个文件中不存在的字段,并用NA填充它们。具有 uniq 标签的记录需要它
  • then sort -f 'label'
    表示对label
  • 字段上的记录进行排序

关于更新的问题: mlr 自行处理 CSV 引用。与您的新预期输出的唯一区别是它删除了多余的引号:

label,Part-A,Part-B
ABC mn,2.0,NA
EFG,NA,1.0
LMN Wv,NA,8
PQR SN,6,6
XYZ,3.0,4.0

此解决方案使用任何 AWK 都可以准确地打印出预期的结果。 请注意,排序算法取自 mawk 手册。

# SO71053039.awk

#-------------------------------------------------
# insertion sort of A[1..n]
function isort( A,A_SWAP,           n,i,j,hold ) {
  n = 0
  for (j in A)
    A_SWAP[++n] = j
  for( i = 2 ; i <= n ; i++)
  {
    hold = A_SWAP[j = i]
    while ( A_SWAP[j-1] "" > "" hold )
    { j-- ; A_SWAP[j+1] = A_SWAP[j] }
    A_SWAP[j] = hold
  }
  # sentinel A_SWAP[0] = "" will be created if needed
  return n
}

BEGIN {
  FS = OFS = ","
  out = "Output.csv"

  # read file 1
  fnr = 0
  while ((getline < ARGV[1]) > 0) {
    ++fnr
    if (fnr == 1) {
      for (i=1; i<=NF; i++)
        FIELDBYNAME1[$i] = i # e.g. FIELDBYNAME1["label"] = 1
    }
    else {
      LABEL_KEY[$FIELDBYNAME1["label"]]
      LABEL_KEY1[$FIELDBYNAME1["label"]] = $FIELDBYNAME1["\"Part-A\""]
    }
  }
  close(ARGV[1])

  # read file2
  fnr = 0
  while ((getline < ARGV[2]) > 0) {
    ++fnr
    if (fnr == 1) {
      for (i=1; i<=NF; i++)
        FIELDBYNAME2[$i] = i # e.g. FIELDBYNAME2["label"] = 1
    }
    else {
      LABEL_KEY[$FIELDBYNAME2["label"]]
      LABEL_KEY2[$FIELDBYNAME2["label"]] = $FIELDBYNAME2["\"Part-B\""]
    }
  }
  close(ARGV[2])

  # print the header
  print "label" OFS "\"Part-A\"" OFS "\"Part-B\"" > out

  # get the result
  z = isort(LABEL_KEY, LABEL_KEY_SWAP)
  for (i = 1; i <= z; i++) {
    result_string = sprintf("%s", LABEL_KEY_SWAP[i])
    if (LABEL_KEY_SWAP[i] in LABEL_KEY1)
      result_string = sprintf("%s", result_string OFS LABEL_KEY1[LABEL_KEY_SWAP[i]] OFS (LABEL_KEY_SWAP[i] in LABEL_KEY2 ? LABEL_KEY2[LABEL_KEY_SWAP[i]] : "NA"))
    else
      result_string = sprintf("%s", result_string OFS "NA" OFS LABEL_KEY2[LABEL_KEY_SWAP[i]])
    print result_string > out
  }
}

通话:

awk -f SO71053039.awk file1.csv file2.csv
=> result file Output.csv with content:
label,"Part-A","Part-B"
"ABC mn","2.0",NA
"EFG",NA,"1.0"
"LMN Wv",NA,"8"
"PQR SN","6","6"
"XYZ","3.0","4.0"

你在我的另一个回答的评论中提到了 join,我忘记了这个实用程序:

#!/bin/sh
rm -f *sorted.csv

# Join two files, normally inner-join only, but
# -  `-a 1 -a 2`:    include "unpaired lines" from file 1 and file 2
# -  `-1 1 -2 1`:    the first column from each is the "join column"
# -  `-o 0,1.2,2.2`: output the "join column" (0) and the second fields from files 1 and 2

join -a 1 -a 2 -1 1 -2 1 -o '0,1.2,2.2' -t, file1.csv file2.csv > joined.csv 

# Add NA values
cat joined.csv | sed 's/,,/,NA,/' | sed 's/,$/,NA/' > unsorted.csv

# Sort, pull out header first
head -n 1 unsorted.csv > sorted.csv

# Then sort remainder
tail -n +2 unsorted.csv | sort -t, -k 1 >> sorted.csv

还有,这里是 sorted.csv

+--------+--------+--------+
| label  | Part-A | Part-B |
+--------+--------+--------+
| ABC mn | 2.0    | NA     |
+--------+--------+--------+
| EFG    | NA     | 1.0    |
+--------+--------+--------+
| LMN Wv | NA     | 8      |
+--------+--------+--------+
| PQR SN | 6      | 6      |
+--------+--------+--------+
| XYZ    | 3.0    | 4.0    |
+--------+--------+--------+

我们建议 gawk 标准脚本 Linux awk:

script.awk

NR == FNR {
  valsStr = sprintf("%s,%s", , "na");
  rowsArr[] = valsStr;
}
NR != FNR &&  in rowsArr {
  split(rowsArr[],valsArr);
  valsStr = sprintf("%s,%s", valsArr[1], );
  rowsArr[] = valsStr;
  next;
}
NR != FNR {
  valsStr = sprintf("%s,%s", "na", );
  rowsArr[] = valsStr;
}
END {
  printf("%s,%s\n", "label", rowsArr["label"]);
  for (rowName in rowsArr) {
     if (rowName == "label") continue;
     printf("%s,%s\n", rowName, rowsArr[rowName]);
  }
}

输出:

awk -F, -f script.awk input.{1,2}.txt

label,Part-A,Part-B
LMN,na,8
ABC,2,na
PQR,6,6
EFG,na,1
XYZ,3,4

正如@Fravadona 在他的评论中正确指出的那样,对于可以包含分隔符、换行符或字段内双引号的 CSV 文件,需要 适当的 CSV 解析器

实际上,只需要两个函数:一个用于取消引用 CSV 字段到普通 AWK 字段,一个用于引用 AWK 字段到将数据写回 CSV 字段。

我已经写了一个我以前的答案的变体() that uses Ed Morton's CSV parser ( gsub 变体适用于任何 AWK 版本)来给出一个正确的 CSV 解析的例子:

此解决方案打印使用任何 AWK 正确排序的期望结果。 请注意,排序算法取自 mawk 手册。

# SO71053039_2.awk

# unquote CSV:
# Ed Morton's CSV parser: 
function buildRec(      fpat,fldNr,fldStr,done) {
    CurrRec = CurrRec [=10=]
    if ( gsub(/"/,"&",CurrRec) % 2 ) {
        # The string built so far in CurrRec has an odd number
        # of "s and so is not yet a complete record.
        CurrRec = CurrRec RS
        done = 0
    }
    else {
        # If CurrRec ended with a null field we would exit the
        # loop below before handling it so ensure that cannot happen.
        # We use a regexp comparison using a bracket expression here
        # and in fpat so it will work even if FS is a regexp metachar
        # or a multi-char string like "\\" for \-separated fields.
        CurrRec = CurrRec ( CurrRec ~ ("[" FS "]$") ? "\"\"" : "" )
        [=10=] = ""
        fpat = "([^" FS "]*)|(\"([^\"]|\"\")+\")"
        while ( (CurrRec != "") && match(CurrRec,fpat) ) {
            fldStr = substr(CurrRec,RSTART,RLENGTH)
            # Convert <"foo"> to <foo> and <"foo""bar"> to <foo"bar>
            if ( sub(/^"/,"",fldStr) && sub(/"$/,"",fldStr) ) {
                gsub(/""/, "\"", fldStr)
            }
            $(++fldNr) = fldStr
            CurrRec = substr(CurrRec,RSTART+RLENGTH+1)
        }
        CurrRec = ""
        done = 1
    }
    return done
}

# quote CSV:
# Quote according to https://datatracker.ietf.org/doc/html/rfc4180 rules
function csvQuote(field, sep) {
  if ((field ~ sep) || (field ~ /["\r\n]/)) {
    gsub(/"/, "\"\"", field)
    field = "\"" field "\""
  }
  return field
}

#-------------------------------------------------
# insertion sort of A[1..n]
function isort( A,A_SWAP,           n,i,j,hold ) {
  n = 0
  for (j in A)
    A_SWAP[++n] = j
  for( i = 2 ; i <= n ; i++)
  {
    hold = A_SWAP[j = i]
    while ( A_SWAP[j-1] "" > "" hold )
    { j-- ; A_SWAP[j+1] = A_SWAP[j] }
    A_SWAP[j] = hold
  }
  # sentinel A_SWAP[0] = "" will be created if needed
  return n
}

BEGIN {
  FS = OFS = ","

  # read file 1
  fnr = 0
  while ((getline < ARGV[1]) > 0) {
    if (! buildRec())
      continue

    ++fnr
    if (fnr == 1) {
      for (i=1; i<=NF; i++)
        FIELDBYNAME1[$i] = i # e.g. FIELDBYNAME1["label"] = 1
    }
    else {
      LABEL_KEY[$FIELDBYNAME1["label"]]
      LABEL_KEY1[$FIELDBYNAME1["label"]] = $FIELDBYNAME1["Part-A"]
    }
  }
  close(ARGV[1])

  # read file2
  fnr = 0
  while ((getline < ARGV[2]) > 0) {
    if (! buildRec())
      continue

    ++fnr
    if (fnr == 1) {
      for (i=1; i<=NF; i++)
        FIELDBYNAME2[$i] = i # e.g. FIELDBYNAME2["label"] = 1
    }
    else {
      LABEL_KEY[$FIELDBYNAME2["label"]]
      LABEL_KEY2[$FIELDBYNAME2["label"]] = $FIELDBYNAME2["Part-B"]
    }
  }
  close(ARGV[2])

  # print the header
  print "label" OFS "Part-A" OFS "Part-B"

  # get the result
  z = isort(LABEL_KEY, LABEL_KEY_SWAP)
  for (i = 1; i <= z; i++) {
    result_string = sprintf("%s", csvQuote(LABEL_KEY_SWAP[i], OFS))
    if (LABEL_KEY_SWAP[i] in LABEL_KEY1)
      result_string = sprintf("%s", result_string OFS csvQuote(LABEL_KEY1[LABEL_KEY_SWAP[i]], OFS) OFS (LABEL_KEY_SWAP[i] in LABEL_KEY2 ? csvQuote(LABEL_KEY2[LABEL_KEY_SWAP[i]], OFS) : "NA"))
    else
      result_string = sprintf("%s", result_string OFS "NA" OFS csvQuote(LABEL_KEY2[LABEL_KEY_SWAP[i]], OFS))
    print result_string
  }
}

致电:

awk -f SO71053039_2.awk file1.csv file2.csv
=> result (superfluous quotes according to CSV rules are omitted):
label,Part-A,Part-B
ABC mn,2.0,NA
EFG,NA,1.0
LMN Wv,NA,8
PQR SN,6,6
XYZ,3.0,4.0