用 Bash 填充 CSV 中每行的独立缺失列(基于预期值)

Pad Independently Missing Columns per Row in CSV with Bash (based off expected values)

我有一个 CSV 文件,其中一行的理想格式是这样的:

taxID#, 学名, 王国, k,门,p,class,c,目,o,科,f, 属, g

...其中 kingdom、phylum 等是标识符、文字("kingdom"、... "phylum")以及标识符后面的值(k、p 等)。 ) 是那些王国、门等的实际值。

示例:

240395,Rugosa emeljanovi,kingdom,Metazoa,phylum,Chordata,class,Amphibia,order,Anura,family,Ranidae,genus,Rugosa

但是,并非所有行都具有所有级别的分类法,即 任何一行都可能缺少 identifier/value 对 的列,例如,"class, c," 并且任何 2 列 PAIR 都可以独立于其他对是否缺失而缺失。此外,如果字段丢失,它们将始终丢失其标识符字段,所以如果没有“k 的值,我永远不会将 "kingdom, phylum" 放在一起”他们之间。因此,我的文件中的大部分内容都缺少随机字段:

...
135487,Nocardia cyriacigeorgica,class,Actinobacteria,order,Corynebacteriales,genus,Nocardia
10090,Mus musculus,kingdom,Metazoa,phylum,Chordata,class,Mammalia,order,Rodentia,family,Muridae,genus,Mus
152507,uncultured actinobacterium,phylum,Actinobacteria,class,Actinobacteria
171953,uncultured Acidobacteria bacterium,phylum,Acidobacteria
77133,uncultured bacterium
...

问题:我如何编写一个bashshell脚本,可以"pad"文件中的每一行,以便每个字段对插入了我的理想格式中可能缺少的,并且其后面的值列只是空白。 期望的输出

...
135487,Nocardia cyriacigeorgica,kingdom,,phylum,,class,Actinobacteria,order,Corynebacteriales,family,,genus,Nocardia
10090,Mus musculus,kingdom,Metazoa,phylum,Chordata,class,Mammalia,order,Rodentia,family,Muridae,genus,Mus
152507,uncultured actinobacterium,kingdom,,phylum,Actinobacteria,class,Actinobacteria,order,,family,,genus,
171953,uncultured Acidobacteria bacterium,phylum,Acidobacteria,clas,,order,,family,,genus,
77133,uncultured bacterium,kingdom,,phylum,,class,,order,,family,,genus,
...

备注:

我试过的:

我希望能够像这样执行我的脚本

bash pad.sh prePadding.csv postPadding.csv

但如果需要,我会接受使用 Mac Excel 2011 的答案。

谢谢!!

虽然在 bash 中应该是可能的,但我会为此使用 Perl。我尽量让代码简单易懂。

#!/usr/bin/perl

while (<>){
    chomp;
    my @fields=split ',';
    my $kingdom='';
    my $phylum='';
    my $class='';
    my $order='';
    my $family='';
    my $genus='';
    for (my $i=2;$i<$#fields;$i+=2){
        if ($fields[$i] eq 'kingdom'){$kingdom=$fields[$i+1];}
        if ($fields[$i] eq 'phylum'){$phylum=$fields[$i+1];}
        if ($fields[$i] eq 'class'){$class=$fields[$i+1];}
        if ($fields[$i] eq 'order'){$order=$fields[$i+1];}
        if ($fields[$i] eq 'family'){$family=$fields[$i+1];}
        if ($fields[$i] eq 'genus'){$genus=$fields[$i+1];}
    }
    print "$fields[0],$fields[1],kingdom,$kingdom,phylum,$phylum,class,$class,order,$order,family,$family,genus,$genus\n";
}

这给了我:

perl pad.pl  input
135487,Nocardia cyriacigeorgica,kingdom,,phylum,,class,Actinobacteria,order,Corynebacteriales,family,,genus,Nocardia
10090,Mus musculus,kingdom,Metazoa,phylum,Chordata,class,Mammalia,order,Rodentia,family,Muridae,genus,Mus
152507,uncultured actinobacterium,kingdom,,phylum,Actinobacteria,class,Actinobacteria,order,,family,,genus,
171953,uncultured Acidobacteria bacterium,kingdom,,phylum,Acidobacteria,class,,order,,family,,genus,

(或为了更好的阅读:)

perl pad.pl  input  | tableize -t | sed 's/^/    /'
+------+----------------------------------+-------+-------+------+--------------+-----+--------------+-----+-----------------+------+-------+-----+--------+
|135487|Nocardia cyriacigeorgica          |kingdom|       |phylum|              |class|Actinobacteria|order|Corynebacteriales|family|       |genus|Nocardia|
+------+----------------------------------+-------+-------+------+--------------+-----+--------------+-----+-----------------+------+-------+-----+--------+
|10090 |Mus musculus                      |kingdom|Metazoa|phylum|Chordata      |class|Mammalia      |order|Rodentia         |family|Muridae|genus|Mus     |
+------+----------------------------------+-------+-------+------+--------------+-----+--------------+-----+-----------------+------+-------+-----+--------+
|152507|uncultured actinobacterium        |kingdom|       |phylum|Actinobacteria|class|Actinobacteria|order|                 |family|       |genus|        |
+------+----------------------------------+-------+-------+------+--------------+-----+--------------+-----+-----------------+------+-------+-----+--------+
|171953|uncultured Acidobacteria bacterium|kingdom|       |phylum|Acidobacteria |class|              |order|                 |family|       |genus|        |
+------+----------------------------------+-------+-------+------+--------------+-----+--------------+-----+-----------------+------+-------+-----+--------+

这将是 bash 使用关联数组的答案:

#!/bin/bash

declare -A THIS
while IFS=, read -a LINE; do
  # we always get the #ID and name
  if (( ${#LINE[@]} < 2 || ${#LINE[@]} % 2 )); then
    echo Invalid CSV line: "${LINE[@]}" >&2
    continue
  fi
  echo -n "${LINE[0]},${LINE[1]},"
  THIS=()
  for (( INDEX=2; INDEX < ${#LINE[@]}; INDEX+=2 )); do
    THIS[${LINE[INDEX]}]=${LINE[INDEX+1]}
  done
  for KEY in kingdom phylum class order family; do
    echo -n $KEY,${THIS[$KEY]},
  done
  echo genus,${THIS[genus]}
done < >

它还会验证 CSV 行,以便它们至少包含 2 列(ID 和名称)并且它们具有偶数列。

可以扩展该脚本以进行更多错误检查(即是否传递了两个参数,是否存在输入等),但它应该按照您发布的方式按预期工作。