如何通过读取字段名称 select 字符串并将其用于 bash 中的不同字段?

How can I select a string by reading a field name and use it to place it in a different field in bash?

我有一个巨大的文本文件(数千行),其中包含由空行分隔的信息块(之前已过滤)。其中一些示例:

Name: CAR 8:1
Precursor type: [M]+
Formula: C15H28NO4
Num Peaks: 2
85.02841 800
286.2013 999

Name: AHexCer (O-28:1)18:1;2O/17:0;O
Precursor type: [M+H]+
Formula: C69H131NO10
Num Peaks: 10
239.2375 150
252.2691 50
264.2691 200
282.2797 100

Name: AHexC (O-8:1)18:1;2O/17:0;O
Precursor type: [M+H]+
Formula: C69H131NO10
Num Peaks: 10
239.2375 150
252.2691 50
264.2691 200
282.2797 100

我想访问字段“名称”中包含的名称,并在“公式”后面添加一个名为“化合物 ID”的新字段 预期输出:

Name: CAR 8:1
Precursor type: [M]+
Formula: C15H28NO4
Compound ID: CAR 8:1
Num Peaks: 2
85.02841 800
286.2013 999

Name: AHexCer (O-28:1)18:1;2O/17:0;O
Precursor type: [M+H]+
Formula: C69H131NO10
Compound ID: AHexCer (O-28:1)18:1;2O/17:0;O
Num Peaks: 10
239.2375 150
252.2691 50
264.2691 200
282.2797 100

Name: AHexC (O-8:1)18:1;2O/17:0;O
Precursor type: [M+H]+
Formula: C69H131NO10
Compound ID: AHexC (O-8:1)18:1;2O/17:0;O
Num Peaks: 10
239.2375 150
252.2691 50
264.2691 200
282.2797 100

到目前为止我已经尝试过:

#the script need to recieve the name of the file to be formatted
original_file=

#negative match to only retain fields of interest and creating a tmp  file
cat $original_file | grep -vE 'MZ|SMILES|TIME|KEY|CCS|MODE|CLASS|Comment' > formatted.txt

#getting Name index
index=($(grep -nE 'Name.+' MSDIAL-TandemMassSpectralAtlas-VS69-Pos_PQI.msp | cut -d ":" -f 1))

#getting the name 
name=($(grep -nE 'Name.+' MSDIAL-TandemMassSpectralAtlas-VS69-Pos_PQI.msp | cut -d " " -f 2,3,4))

#reading and adding the new field
for i in ${index[*]}; do
    for x in ${name[$i]}; do
        p=4
        pos=`expr $i + $p`
        sed -i "$pos i Compound ID: ${x}" formatted.txt
    done
done

在内部 for 循环中给我带来了问题,因为名称由空 space 分隔,所以我将索引号与名称索引相匹配的策略不起作用。

不知道 bash 或 awk 中是否有办法做到这一点。

如果sed是一个选项,你可以试试这个

$ sed '/Name:/{p;s/[^:]*\(.*\)/Compound ID/;h;d};/Formula:/{G}' input_file
Name: CAR 8:1
Precursor type: [M]+
Formula: C15H28NO4
Compound ID: CAR 8:1
Num Peaks: 2
85.02841 800
286.2013 999

Name: AHexCer (O-28:1)18:1;2O/17:0;O
Precursor type: [M+H]+
Formula: C69H131NO10
Compound ID: AHexCer (O-28:1)18:1;2O/17:0;O
Num Peaks: 10
239.2375 150
252.2691 50
264.2691 200
282.2797 100

Name: AHexC (O-8:1)18:1;2O/17:0;O
Precursor type: [M+H]+
Formula: C69H131NO10
Compound ID: AHexC (O-8:1)18:1;2O/17:0;O
Num Peaks: 10
239.2375 150
252.2691 50
264.2691 200
282.2797 100
  • /Name:/{p - 将行与 Name: 匹配并打印出来。这将创建一个重复行。

  • s/[^:]*\(.*\)/Compound ID/;h;d} - 使用新创建的 dup 行,通过删除 Name: 并保留其他所有内容来对其进行转换,将 Name: 替换为 Compound ID,保留它 space 然后删除它。

  • /Formula:/{G} - 将行与 Formula: 匹配,然后附加保留内容 space.

假设:

  • 所有 'chunks' 包含一个 Name: 行和一个 Forumla:

一个awk想法:

awk '
{ print [=10=]                                       # print current line
  if ( == "Name:")
     compid=gensub(/Name:/,"Compound ID:",1)     # save current line, replacing "Name:" with "Compound ID:"
  if ( == "Formula:") 
     print compid                                # print "Compound ID:" line
}
' input.txt

这会生成:

Name: CAR 8:1
Precursor type: [M]+
Formula: C15H28NO4
Compound ID: CAR 8:1
Num Peaks: 2
85.02841 800
286.2013 999

Name: AHexCer (O-28:1)18:1;2O/17:0;O
Precursor type: [M+H]+
Formula: C69H131NO10
Compound ID: AHexCer (O-28:1)18:1;2O/17:0;O
Num Peaks: 10
239.2375 150
252.2691 50
264.2691 200
282.2797 100

Name: AHexC (O-8:1)18:1;2O/17:0;O
Precursor type: [M+H]+
Formula: C69H131NO10
Compound ID: AHexC (O-8:1)18:1;2O/17:0;O
Num Peaks: 10
239.2375 150
252.2691 50
264.2691 200
282.2797 100
awk '{print;if(=="Name:"){="";n="Compound ID:"[=10=]};if(=="Formula:")print n}' file.txt

还有 awk 段落模式:

awk -v RS= -v FS='\n' -v OFS='\n' -v ORS='\n\n' '
        {s=substr(, 7)}
         ~ /^Formula:/ { =  "\nCompound ID: " s} 1' file
Name: CAR 8:1
Precursor type: [M]+
Formula: C15H28NO4
Compound ID: CAR 8:1
Num Peaks: 2
85.02841 800
286.2013 999

Name: AHexCer (O-28:1)18:1;2O/17:0;O
Precursor type: [M+H]+
Formula: C69H131NO10
Compound ID: AHexCer (O-28:1)18:1;2O/17:0;O
Num Peaks: 10
239.2375 150
252.2691 50
264.2691 200
282.2797 100

Name: AHexC (O-8:1)18:1;2O/17:0;O
Precursor type: [M+H]+
Formula: C69H131NO10
Compound ID: AHexC (O-8:1)18:1;2O/17:0;O
Num Peaks: 10
239.2375 150
252.2691 50
264.2691 200
282.2797 100
  • 我们可以把BEGIN部分放在RSFSOFSORS中,而不是分开:
awk 'BEGIN{RS="";ORS="\n\n";FS=OFS="\n"}
    {s=substr(, 7)}
     ~ /^Formula:/ { =  "\nCompound ID: " s} 1' file