如何通过读取字段名称 select 字符串并将其用于 bash 中的不同字段?
How can I select a string by reading a field name and use it to place it in a different field in bash?
我有一个巨大的文本文件(数千行),其中包含由空行分隔的信息块(之前已过滤)。其中一些示例:
Name: CAR 8:1
Precursor type: [M]+
Formula: C15H28NO4
Num Peaks: 2
85.02841 800
286.2013 999
Name: AHexCer (O-28:1)18:1;2O/17:0;O
Precursor type: [M+H]+
Formula: C69H131NO10
Num Peaks: 10
239.2375 150
252.2691 50
264.2691 200
282.2797 100
Name: AHexC (O-8:1)18:1;2O/17:0;O
Precursor type: [M+H]+
Formula: C69H131NO10
Num Peaks: 10
239.2375 150
252.2691 50
264.2691 200
282.2797 100
我想访问字段“名称”中包含的名称,并在“公式”后面添加一个名为“化合物 ID”的新字段
预期输出:
Name: CAR 8:1
Precursor type: [M]+
Formula: C15H28NO4
Compound ID: CAR 8:1
Num Peaks: 2
85.02841 800
286.2013 999
Name: AHexCer (O-28:1)18:1;2O/17:0;O
Precursor type: [M+H]+
Formula: C69H131NO10
Compound ID: AHexCer (O-28:1)18:1;2O/17:0;O
Num Peaks: 10
239.2375 150
252.2691 50
264.2691 200
282.2797 100
Name: AHexC (O-8:1)18:1;2O/17:0;O
Precursor type: [M+H]+
Formula: C69H131NO10
Compound ID: AHexC (O-8:1)18:1;2O/17:0;O
Num Peaks: 10
239.2375 150
252.2691 50
264.2691 200
282.2797 100
到目前为止我已经尝试过:
#the script need to recieve the name of the file to be formatted
original_file=
#negative match to only retain fields of interest and creating a tmp file
cat $original_file | grep -vE 'MZ|SMILES|TIME|KEY|CCS|MODE|CLASS|Comment' > formatted.txt
#getting Name index
index=($(grep -nE 'Name.+' MSDIAL-TandemMassSpectralAtlas-VS69-Pos_PQI.msp | cut -d ":" -f 1))
#getting the name
name=($(grep -nE 'Name.+' MSDIAL-TandemMassSpectralAtlas-VS69-Pos_PQI.msp | cut -d " " -f 2,3,4))
#reading and adding the new field
for i in ${index[*]}; do
for x in ${name[$i]}; do
p=4
pos=`expr $i + $p`
sed -i "$pos i Compound ID: ${x}" formatted.txt
done
done
在内部 for 循环中给我带来了问题,因为名称由空 space 分隔,所以我将索引号与名称索引相匹配的策略不起作用。
不知道 bash 或 awk 中是否有办法做到这一点。
如果sed
是一个选项,你可以试试这个
$ sed '/Name:/{p;s/[^:]*\(.*\)/Compound ID/;h;d};/Formula:/{G}' input_file
Name: CAR 8:1
Precursor type: [M]+
Formula: C15H28NO4
Compound ID: CAR 8:1
Num Peaks: 2
85.02841 800
286.2013 999
Name: AHexCer (O-28:1)18:1;2O/17:0;O
Precursor type: [M+H]+
Formula: C69H131NO10
Compound ID: AHexCer (O-28:1)18:1;2O/17:0;O
Num Peaks: 10
239.2375 150
252.2691 50
264.2691 200
282.2797 100
Name: AHexC (O-8:1)18:1;2O/17:0;O
Precursor type: [M+H]+
Formula: C69H131NO10
Compound ID: AHexC (O-8:1)18:1;2O/17:0;O
Num Peaks: 10
239.2375 150
252.2691 50
264.2691 200
282.2797 100
/Name:/{p
- 将行与 Name:
匹配并打印出来。这将创建一个重复行。
s/[^:]*\(.*\)/Compound ID/;h;d}
- 使用新创建的 dup 行,通过删除 Name:
并保留其他所有内容来对其进行转换,将 Name:
替换为 Compound ID
,保留它 space 然后删除它。
/Formula:/{G}
- 将行与 Formula:
匹配,然后附加保留内容 space.
假设:
- 所有 'chunks' 包含一个
Name:
行和一个 Forumla:
行
一个awk
想法:
awk '
{ print [=10=] # print current line
if ( == "Name:")
compid=gensub(/Name:/,"Compound ID:",1) # save current line, replacing "Name:" with "Compound ID:"
if ( == "Formula:")
print compid # print "Compound ID:" line
}
' input.txt
这会生成:
Name: CAR 8:1
Precursor type: [M]+
Formula: C15H28NO4
Compound ID: CAR 8:1
Num Peaks: 2
85.02841 800
286.2013 999
Name: AHexCer (O-28:1)18:1;2O/17:0;O
Precursor type: [M+H]+
Formula: C69H131NO10
Compound ID: AHexCer (O-28:1)18:1;2O/17:0;O
Num Peaks: 10
239.2375 150
252.2691 50
264.2691 200
282.2797 100
Name: AHexC (O-8:1)18:1;2O/17:0;O
Precursor type: [M+H]+
Formula: C69H131NO10
Compound ID: AHexC (O-8:1)18:1;2O/17:0;O
Num Peaks: 10
239.2375 150
252.2691 50
264.2691 200
282.2797 100
awk '{print;if(=="Name:"){="";n="Compound ID:"[=10=]};if(=="Formula:")print n}' file.txt
还有 awk
段落模式:
awk -v RS= -v FS='\n' -v OFS='\n' -v ORS='\n\n' '
{s=substr(, 7)}
~ /^Formula:/ { = "\nCompound ID: " s} 1' file
Name: CAR 8:1
Precursor type: [M]+
Formula: C15H28NO4
Compound ID: CAR 8:1
Num Peaks: 2
85.02841 800
286.2013 999
Name: AHexCer (O-28:1)18:1;2O/17:0;O
Precursor type: [M+H]+
Formula: C69H131NO10
Compound ID: AHexCer (O-28:1)18:1;2O/17:0;O
Num Peaks: 10
239.2375 150
252.2691 50
264.2691 200
282.2797 100
Name: AHexC (O-8:1)18:1;2O/17:0;O
Precursor type: [M+H]+
Formula: C69H131NO10
Compound ID: AHexC (O-8:1)18:1;2O/17:0;O
Num Peaks: 10
239.2375 150
252.2691 50
264.2691 200
282.2797 100
- 我们可以把
BEGIN
部分放在RS
、FS
、OFS
和ORS
中,而不是分开:
awk 'BEGIN{RS="";ORS="\n\n";FS=OFS="\n"}
{s=substr(, 7)}
~ /^Formula:/ { = "\nCompound ID: " s} 1' file
我有一个巨大的文本文件(数千行),其中包含由空行分隔的信息块(之前已过滤)。其中一些示例:
Name: CAR 8:1
Precursor type: [M]+
Formula: C15H28NO4
Num Peaks: 2
85.02841 800
286.2013 999
Name: AHexCer (O-28:1)18:1;2O/17:0;O
Precursor type: [M+H]+
Formula: C69H131NO10
Num Peaks: 10
239.2375 150
252.2691 50
264.2691 200
282.2797 100
Name: AHexC (O-8:1)18:1;2O/17:0;O
Precursor type: [M+H]+
Formula: C69H131NO10
Num Peaks: 10
239.2375 150
252.2691 50
264.2691 200
282.2797 100
我想访问字段“名称”中包含的名称,并在“公式”后面添加一个名为“化合物 ID”的新字段 预期输出:
Name: CAR 8:1
Precursor type: [M]+
Formula: C15H28NO4
Compound ID: CAR 8:1
Num Peaks: 2
85.02841 800
286.2013 999
Name: AHexCer (O-28:1)18:1;2O/17:0;O
Precursor type: [M+H]+
Formula: C69H131NO10
Compound ID: AHexCer (O-28:1)18:1;2O/17:0;O
Num Peaks: 10
239.2375 150
252.2691 50
264.2691 200
282.2797 100
Name: AHexC (O-8:1)18:1;2O/17:0;O
Precursor type: [M+H]+
Formula: C69H131NO10
Compound ID: AHexC (O-8:1)18:1;2O/17:0;O
Num Peaks: 10
239.2375 150
252.2691 50
264.2691 200
282.2797 100
到目前为止我已经尝试过:
#the script need to recieve the name of the file to be formatted
original_file=
#negative match to only retain fields of interest and creating a tmp file
cat $original_file | grep -vE 'MZ|SMILES|TIME|KEY|CCS|MODE|CLASS|Comment' > formatted.txt
#getting Name index
index=($(grep -nE 'Name.+' MSDIAL-TandemMassSpectralAtlas-VS69-Pos_PQI.msp | cut -d ":" -f 1))
#getting the name
name=($(grep -nE 'Name.+' MSDIAL-TandemMassSpectralAtlas-VS69-Pos_PQI.msp | cut -d " " -f 2,3,4))
#reading and adding the new field
for i in ${index[*]}; do
for x in ${name[$i]}; do
p=4
pos=`expr $i + $p`
sed -i "$pos i Compound ID: ${x}" formatted.txt
done
done
在内部 for 循环中给我带来了问题,因为名称由空 space 分隔,所以我将索引号与名称索引相匹配的策略不起作用。
不知道 bash 或 awk 中是否有办法做到这一点。
如果sed
是一个选项,你可以试试这个
$ sed '/Name:/{p;s/[^:]*\(.*\)/Compound ID/;h;d};/Formula:/{G}' input_file
Name: CAR 8:1
Precursor type: [M]+
Formula: C15H28NO4
Compound ID: CAR 8:1
Num Peaks: 2
85.02841 800
286.2013 999
Name: AHexCer (O-28:1)18:1;2O/17:0;O
Precursor type: [M+H]+
Formula: C69H131NO10
Compound ID: AHexCer (O-28:1)18:1;2O/17:0;O
Num Peaks: 10
239.2375 150
252.2691 50
264.2691 200
282.2797 100
Name: AHexC (O-8:1)18:1;2O/17:0;O
Precursor type: [M+H]+
Formula: C69H131NO10
Compound ID: AHexC (O-8:1)18:1;2O/17:0;O
Num Peaks: 10
239.2375 150
252.2691 50
264.2691 200
282.2797 100
/Name:/{p
- 将行与Name:
匹配并打印出来。这将创建一个重复行。s/[^:]*\(.*\)/Compound ID/;h;d}
- 使用新创建的 dup 行,通过删除Name:
并保留其他所有内容来对其进行转换,将Name:
替换为Compound ID
,保留它 space 然后删除它。/Formula:/{G}
- 将行与Formula:
匹配,然后附加保留内容 space.
假设:
- 所有 'chunks' 包含一个
Name:
行和一个Forumla:
行
一个awk
想法:
awk '
{ print [=10=] # print current line
if ( == "Name:")
compid=gensub(/Name:/,"Compound ID:",1) # save current line, replacing "Name:" with "Compound ID:"
if ( == "Formula:")
print compid # print "Compound ID:" line
}
' input.txt
这会生成:
Name: CAR 8:1
Precursor type: [M]+
Formula: C15H28NO4
Compound ID: CAR 8:1
Num Peaks: 2
85.02841 800
286.2013 999
Name: AHexCer (O-28:1)18:1;2O/17:0;O
Precursor type: [M+H]+
Formula: C69H131NO10
Compound ID: AHexCer (O-28:1)18:1;2O/17:0;O
Num Peaks: 10
239.2375 150
252.2691 50
264.2691 200
282.2797 100
Name: AHexC (O-8:1)18:1;2O/17:0;O
Precursor type: [M+H]+
Formula: C69H131NO10
Compound ID: AHexC (O-8:1)18:1;2O/17:0;O
Num Peaks: 10
239.2375 150
252.2691 50
264.2691 200
282.2797 100
awk '{print;if(=="Name:"){="";n="Compound ID:"[=10=]};if(=="Formula:")print n}' file.txt
还有 awk
段落模式:
awk -v RS= -v FS='\n' -v OFS='\n' -v ORS='\n\n' '
{s=substr(, 7)}
~ /^Formula:/ { = "\nCompound ID: " s} 1' file
Name: CAR 8:1
Precursor type: [M]+
Formula: C15H28NO4
Compound ID: CAR 8:1
Num Peaks: 2
85.02841 800
286.2013 999
Name: AHexCer (O-28:1)18:1;2O/17:0;O
Precursor type: [M+H]+
Formula: C69H131NO10
Compound ID: AHexCer (O-28:1)18:1;2O/17:0;O
Num Peaks: 10
239.2375 150
252.2691 50
264.2691 200
282.2797 100
Name: AHexC (O-8:1)18:1;2O/17:0;O
Precursor type: [M+H]+
Formula: C69H131NO10
Compound ID: AHexC (O-8:1)18:1;2O/17:0;O
Num Peaks: 10
239.2375 150
252.2691 50
264.2691 200
282.2797 100
- 我们可以把
BEGIN
部分放在RS
、FS
、OFS
和ORS
中,而不是分开:
awk 'BEGIN{RS="";ORS="\n\n";FS=OFS="\n"}
{s=substr(, 7)}
~ /^Formula:/ { = "\nCompound ID: " s} 1' file