如何通过读取字段名称 select 字符串并将其用于 bash 中的不同字段？

Question

我有一个巨大的文本文件（数千行），其中包含由空行分隔的信息块（之前已过滤）。其中一些示例：

Name: CAR 8:1
Precursor type: [M]+
Formula: C15H28NO4
Num Peaks: 2
85.02841 800
286.2013 999

Name: AHexCer (O-28:1)18:1;2O/17:0;O
Precursor type: [M+H]+
Formula: C69H131NO10
Num Peaks: 10
239.2375 150
252.2691 50
264.2691 200
282.2797 100

Name: AHexC (O-8:1)18:1;2O/17:0;O
Precursor type: [M+H]+
Formula: C69H131NO10
Num Peaks: 10
239.2375 150
252.2691 50
264.2691 200
282.2797 100

我想访问字段“名称”中包含的名称，并在“公式”后面添加一个名为“化合物 ID”的新字段预期输出：

Name: CAR 8:1
Precursor type: [M]+
Formula: C15H28NO4
Compound ID: CAR 8:1
Num Peaks: 2
85.02841 800
286.2013 999

Name: AHexCer (O-28:1)18:1;2O/17:0;O
Precursor type: [M+H]+
Formula: C69H131NO10
Compound ID: AHexCer (O-28:1)18:1;2O/17:0;O
Num Peaks: 10
239.2375 150
252.2691 50
264.2691 200
282.2797 100

Name: AHexC (O-8:1)18:1;2O/17:0;O
Precursor type: [M+H]+
Formula: C69H131NO10
Compound ID: AHexC (O-8:1)18:1;2O/17:0;O
Num Peaks: 10
239.2375 150
252.2691 50
264.2691 200
282.2797 100

到目前为止我已经尝试过：

#the script need to recieve the name of the file to be formatted
original_file=

#negative match to only retain fields of interest and creating a tmp  file
cat $original_file | grep -vE 'MZ|SMILES|TIME|KEY|CCS|MODE|CLASS|Comment' > formatted.txt

#getting Name index
index=($(grep -nE 'Name.+' MSDIAL-TandemMassSpectralAtlas-VS69-Pos_PQI.msp | cut -d ":" -f 1))

#getting the name 
name=($(grep -nE 'Name.+' MSDIAL-TandemMassSpectralAtlas-VS69-Pos_PQI.msp | cut -d " " -f 2,3,4))

#reading and adding the new field
for i in ${index[*]}; do
    for x in ${name[$i]}; do
        p=4
        pos=`expr $i + $p`
        sed -i "$pos i Compound ID: ${x}" formatted.txt
    done
done

在内部 for 循环中给我带来了问题，因为名称由空 space 分隔，所以我将索引号与名称索引相匹配的策略不起作用。

不知道 bash 或 awk 中是否有办法做到这一点。

Answer 1

如果sed是一个选项，你可以试试这个

$ sed '/Name:/{p;s/[^:]*\(.*\)/Compound ID/;h;d};/Formula:/{G}' input_file
Name: CAR 8:1
Precursor type: [M]+
Formula: C15H28NO4
Compound ID: CAR 8:1
Num Peaks: 2
85.02841 800
286.2013 999

Name: AHexCer (O-28:1)18:1;2O/17:0;O
Precursor type: [M+H]+
Formula: C69H131NO10
Compound ID: AHexCer (O-28:1)18:1;2O/17:0;O
Num Peaks: 10
239.2375 150
252.2691 50
264.2691 200
282.2797 100

Name: AHexC (O-8:1)18:1;2O/17:0;O
Precursor type: [M+H]+
Formula: C69H131NO10
Compound ID: AHexC (O-8:1)18:1;2O/17:0;O
Num Peaks: 10
239.2375 150
252.2691 50
264.2691 200
282.2797 100

/Name:/{p - 将行与 Name: 匹配并打印出来。这将创建一个重复行。
s/[^:]*\(.*\)/Compound ID/;h;d} - 使用新创建的 dup 行，通过删除 Name: 并保留其他所有内容来对其进行转换，将 Name: 替换为 Compound ID，保留它 space 然后删除它。
/Formula:/{G} - 将行与 Formula: 匹配，然后附加保留内容 space.

Answer 2

假设：

所有 'chunks' 包含一个 Name: 行和一个 Forumla: 行

一个awk想法：

awk '
{ print [=10=]                                       # print current line
  if ( == "Name:")
     compid=gensub(/Name:/,"Compound ID:",1)     # save current line, replacing "Name:" with "Compound ID:"
  if ( == "Formula:") 
     print compid                                # print "Compound ID:" line
}
' input.txt

这会生成：

Name: CAR 8:1
Precursor type: [M]+
Formula: C15H28NO4
Compound ID: CAR 8:1
Num Peaks: 2
85.02841 800
286.2013 999

Name: AHexCer (O-28:1)18:1;2O/17:0;O
Precursor type: [M+H]+
Formula: C69H131NO10
Compound ID: AHexCer (O-28:1)18:1;2O/17:0;O
Num Peaks: 10
239.2375 150
252.2691 50
264.2691 200
282.2797 100

Name: AHexC (O-8:1)18:1;2O/17:0;O
Precursor type: [M+H]+
Formula: C69H131NO10
Compound ID: AHexC (O-8:1)18:1;2O/17:0;O
Num Peaks: 10
239.2375 150
252.2691 50
264.2691 200
282.2797 100

Answer 3

awk '{print;if(=="Name:"){="";n="Compound ID:"[=10=]};if(=="Formula:")print n}' file.txt

Answer 4

还有 awk 段落模式：

awk -v RS= -v FS='\n' -v OFS='\n' -v ORS='\n\n' '
        {s=substr(, 7)}
         ~ /^Formula:/ { =  "\nCompound ID: " s} 1' file
Name: CAR 8:1
Precursor type: [M]+
Formula: C15H28NO4
Compound ID: CAR 8:1
Num Peaks: 2
85.02841 800
286.2013 999

Name: AHexCer (O-28:1)18:1;2O/17:0;O
Precursor type: [M+H]+
Formula: C69H131NO10
Compound ID: AHexCer (O-28:1)18:1;2O/17:0;O
Num Peaks: 10
239.2375 150
252.2691 50
264.2691 200
282.2797 100

Name: AHexC (O-8:1)18:1;2O/17:0;O
Precursor type: [M+H]+
Formula: C69H131NO10
Compound ID: AHexC (O-8:1)18:1;2O/17:0;O
Num Peaks: 10
239.2375 150
252.2691 50
264.2691 200
282.2797 100

我们可以把BEGIN部分放在RS、FS、OFS和ORS中，而不是分开：

awk 'BEGIN{RS="";ORS="\n\n";FS=OFS="\n"}
    {s=substr(, 7)}
     ~ /^Formula:/ { =  "\nCompound ID: " s} 1' file

如何通过读取字段名称 select 字符串并将其用于 bash 中的不同字段？

How can I select a string by reading a field name and use it to place it in a different field in bash?

linux

bash

awk

sed

text-manipulation