在 bash 环境中从 sdf 文件中提取数据
Extract data from sdf file in bash environment
我想从 SDF 文件中提取数据。
我想将 > <Name>
和 > <SCORE.INTER>
值保存在 .tsv 文件中。
有什么方法可以快速解决吗?通过 awk?
提前致谢。
SDF文件由数千个Block组成。文件的一个块如下所示:
ZINC000169748276
38 39 0 0 0 0 0 0 0 0999 V2000
11.2318 3.6419 22.3134 C 0 0 0 0 0 0
12.5621 3.7685 22.2617 C 0 0 0 0 0 0
13.0725 5.1806 22.3121 C 0 0 0 0 0 0
10.8850 6.0303 22.4462 C 0 0 0 0 0 0
13.4310 2.6268 22.1614 C 0 0 0 0 0 0
12.9848 1.3691 22.0592 C 0 0 0 0 0 0
8.2548 4.7608 21.1375 C 0 0 0 0 0 0
7.1479 3.7322 21.1132 C 0 0 0 0 0 0
7.7728 2.5366 21.8185 C 0 0 0 0 0 0
8.9539 4.4605 22.4534 C 0 0 0 0 0 0
13.8873 0.1824 21.9500 C 0 0 0 0 0 0
8.5117 1.6060 20.8656 C 0 0 0 0 0 0
12.2544 6.2009 22.3970 N 0 0 0 0 0 0
10.3635 4.7178 22.4055 N 0 0 0 0 0 0
14.4254 5.4429 22.2718 N 0 0 0 0 0 0
13.7646 -0.5167 20.6443 N 0 3 0 0 0 0
6.5529 -4.6019 19.9460 O 0 5 0 0 0 0
8.2203 -4.0310 21.8048 O 0 5 0 0 0 0
6.8149 1.6459 17.3793 O 0 5 0 0 0 0
5.4231 -2.1179 18.5726 O 0 5 0 0 0 0
10.1403 7.0090 22.5243 O 0 0 0 0 0 0
5.7155 -3.6365 22.1679 O 0 0 0 0 0 0
5.6431 1.8811 19.7228 O 0 0 0 0 0 0
5.0295 -0.6218 20.7059 O 0 0 0 0 0 0
8.7342 3.0736 22.7475 O 0 0 0 0 0 0
6.0324 4.2091 21.8626 O 0 0 0 0 0 0
8.1857 1.9631 19.5323 O 0 0 0 0 0 0
7.0232 -2.2197 20.5667 O 0 0 0 0 0 0
7.0081 -0.1966 19.1450 O 0 0 0 0 0 0
6.8632 -3.7464 21.1697 P 0 0 0 0 0 0
6.7991 1.4009 18.8725 P 0 0 0 0 0 0
5.9605 -1.3044 19.7288 P 0 0 0 0 0 0
15.0444 4.6730 22.2089 H 0 0 0 0 0 0
14.7148 6.3890 22.3078 H 0 0 0 0 0 0
14.3405 -1.3642 20.6292 H 0 0 0 0 0 0
14.0706 0.0896 19.8769 H 0 0 0 0 0 0
12.7928 -0.7891 20.4667 H 0 0 0 0 0 0
5.3352 3.5319 21.8055 H 0 0 0 0 0 0
1 2 2 0 0 0
1 14 1 0 0 0
2 3 1 0 0 0
2 5 1 0 0 0
3 13 2 0 0 0
3 15 1 0 0 0
4 13 1 0 0 0
4 14 1 0 0 0
4 21 2 0 0 0
5 6 2 0 0 0
6 11 1 0 0 0
7 8 1 0 0 0
7 10 1 0 0 0
8 9 1 0 0 0
8 26 1 0 0 0
9 12 1 0 0 0
9 25 1 0 0 0
10 14 1 0 0 0
10 25 1 0 0 0
11 16 1 0 0 0
12 27 1 0 0 0
17 30 1 0 0 0
18 30 1 0 0 0
19 31 1 0 0 0
20 32 1 0 0 0
22 30 2 0 0 0
23 31 2 0 0 0
24 32 2 0 0 0
27 31 1 0 0 0
28 30 1 0 0 0
28 32 1 0 0 0
29 31 1 0 0 0
29 32 1 0 0 0
15 33 1 0 0 0
15 34 1 0 0 0
16 35 1 0 0 0
16 36 1 0 0 0
16 37 1 0 0 0
26 38 1 0 0 0
M END
> <CHROM.1>
2.74804207,-114.83879868,178.63419806,-11.86097681,-104.18799792,-175.61867989
-82.60305529,-167.43897154,58.52671946,-50.63759561,-111.24083331,101.74294800
8.69431853,1.29062552,20.98254072,-0.89039136,0.27787279,-3.08051579
> <Name>
ZINC000169748276
> <RI>
1.76083e+07
> <Rbt.Executable>
rbdock/0.1.0
> <Rbt.Library>
librxdock.so/0.1.0
> <SCORE>
-41.7582
> <SCORE.INTER>
-41.8551
> <SCORE.INTER.CONST>
1
> <SCORE.INTER.POLAR>
-4.96496
> <SCORE.INTER.REPUL>
0
> <SCORE.INTER.ROT>
10
> <SCORE.INTER.VDW>
-40.3742
> <SCORE.INTER.norm>
-1.30797
> <SCORE.INTRA>
0.0969082
> <SCORE.INTRA.DIHEDRAL>
-5.79141
> <SCORE.INTRA.DIHEDRAL.0>
19.5819
> <SCORE.INTRA.POLAR>
0
> <SCORE.INTRA.POLAR.0>
0
> <SCORE.INTRA.REPUL>
0
> <SCORE.INTRA.REPUL.0>
0
> <SCORE.INTRA.VDW>
2.99261
> <SCORE.INTRA.VDW.0>
-5.2787
> <SCORE.INTRA.norm>
0.00302838
> <SCORE.RESTR>
0
> <SCORE.RESTR.CAVITY>
0
> <SCORE.RESTR.norm>
0
> <SCORE.SYSTEM>
0
> <SCORE.SYSTEM.CONST>
0
> <SCORE.SYSTEM.DIHEDRAL>
0
> <SCORE.SYSTEM.norm>
0
> <SCORE.heavy>
32
> <SCORE.norm>
-1.30494
$$$$
.tsv 文件应如下所示:
ZINC000169748276 -41.8551
ZINC000079214514 -41.7892
ZINC000195993528 -40.9293
为什么 awk
?
Prompt> grep -A 1 -i "<NAME>" test.txt | tail -n 1
ZINC000169748276
Prompt> grep -A 1 -i "<SCORE.INTER>" test.txt | tail -n 1
-41.8551
如您所见,grep
要容易得多。
-A 1
表示“也取下 1 行”。
经过一番讨论,这是最终的解决方案:
grep -A 1 -i "<SCORE.INTER>" test.sdf | grep -v '^>' | grep -v '^--' >> results
I want to save the > <NAME>
and > <SCORE.INTER>
values in a .tsv
file. Is there any way for a quick solution e.g. via awk?
您的文件有 > <Name>
而不是 > <NAME>
(如果您以 case-sensitive 方式匹配,则有重要区别)。我会按照以下方式使用 GNU AWK
完成此任务(假设 > <Name>
通常在 > <SCORE.INTER>
之前并且每个 > <SCORE.INTER>
都有相应的 > <Name>
)让 file.txt
内容
ZINC000169748276
38 39 0 0 0 0 0 0 0 0999 V2000
11.2318 3.6419 22.3134 C 0 0 0 0 0 0
12.5621 3.7685 22.2617 C 0 0 0 0 0 0
13.0725 5.1806 22.3121 C 0 0 0 0 0 0
10.8850 6.0303 22.4462 C 0 0 0 0 0 0
13.4310 2.6268 22.1614 C 0 0 0 0 0 0
12.9848 1.3691 22.0592 C 0 0 0 0 0 0
8.2548 4.7608 21.1375 C 0 0 0 0 0 0
7.1479 3.7322 21.1132 C 0 0 0 0 0 0
7.7728 2.5366 21.8185 C 0 0 0 0 0 0
8.9539 4.4605 22.4534 C 0 0 0 0 0 0
13.8873 0.1824 21.9500 C 0 0 0 0 0 0
8.5117 1.6060 20.8656 C 0 0 0 0 0 0
12.2544 6.2009 22.3970 N 0 0 0 0 0 0
10.3635 4.7178 22.4055 N 0 0 0 0 0 0
14.4254 5.4429 22.2718 N 0 0 0 0 0 0
13.7646 -0.5167 20.6443 N 0 3 0 0 0 0
6.5529 -4.6019 19.9460 O 0 5 0 0 0 0
8.2203 -4.0310 21.8048 O 0 5 0 0 0 0
6.8149 1.6459 17.3793 O 0 5 0 0 0 0
5.4231 -2.1179 18.5726 O 0 5 0 0 0 0
10.1403 7.0090 22.5243 O 0 0 0 0 0 0
5.7155 -3.6365 22.1679 O 0 0 0 0 0 0
5.6431 1.8811 19.7228 O 0 0 0 0 0 0
5.0295 -0.6218 20.7059 O 0 0 0 0 0 0
8.7342 3.0736 22.7475 O 0 0 0 0 0 0
6.0324 4.2091 21.8626 O 0 0 0 0 0 0
8.1857 1.9631 19.5323 O 0 0 0 0 0 0
7.0232 -2.2197 20.5667 O 0 0 0 0 0 0
7.0081 -0.1966 19.1450 O 0 0 0 0 0 0
6.8632 -3.7464 21.1697 P 0 0 0 0 0 0
6.7991 1.4009 18.8725 P 0 0 0 0 0 0
5.9605 -1.3044 19.7288 P 0 0 0 0 0 0
15.0444 4.6730 22.2089 H 0 0 0 0 0 0
14.7148 6.3890 22.3078 H 0 0 0 0 0 0
14.3405 -1.3642 20.6292 H 0 0 0 0 0 0
14.0706 0.0896 19.8769 H 0 0 0 0 0 0
12.7928 -0.7891 20.4667 H 0 0 0 0 0 0
5.3352 3.5319 21.8055 H 0 0 0 0 0 0
1 2 2 0 0 0
1 14 1 0 0 0
2 3 1 0 0 0
2 5 1 0 0 0
3 13 2 0 0 0
3 15 1 0 0 0
4 13 1 0 0 0
4 14 1 0 0 0
4 21 2 0 0 0
5 6 2 0 0 0
6 11 1 0 0 0
7 8 1 0 0 0
7 10 1 0 0 0
8 9 1 0 0 0
8 26 1 0 0 0
9 12 1 0 0 0
9 25 1 0 0 0
10 14 1 0 0 0
10 25 1 0 0 0
11 16 1 0 0 0
12 27 1 0 0 0
17 30 1 0 0 0
18 30 1 0 0 0
19 31 1 0 0 0
20 32 1 0 0 0
22 30 2 0 0 0
23 31 2 0 0 0
24 32 2 0 0 0
27 31 1 0 0 0
28 30 1 0 0 0
28 32 1 0 0 0
29 31 1 0 0 0
29 32 1 0 0 0
15 33 1 0 0 0
15 34 1 0 0 0
16 35 1 0 0 0
16 36 1 0 0 0
16 37 1 0 0 0
26 38 1 0 0 0
M END
> <CHROM.1>
2.74804207,-114.83879868,178.63419806,-11.86097681,-104.18799792,-175.61867989
-82.60305529,-167.43897154,58.52671946,-50.63759561,-111.24083331,101.74294800
8.69431853,1.29062552,20.98254072,-0.89039136,0.27787279,-3.08051579
> <Name>
ZINC000169748276
> <RI>
1.76083e+07
> <Rbt.Executable>
rbdock/0.1.0
> <Rbt.Library>
librxdock.so/0.1.0
> <SCORE>
-41.7582
> <SCORE.INTER>
-41.8551
> <SCORE.INTER.CONST>
1
> <SCORE.INTER.POLAR>
-4.96496
> <SCORE.INTER.REPUL>
0
> <SCORE.INTER.ROT>
10
> <SCORE.INTER.VDW>
-40.3742
> <SCORE.INTER.norm>
-1.30797
> <SCORE.INTRA>
0.0969082
> <SCORE.INTRA.DIHEDRAL>
-5.79141
> <SCORE.INTRA.DIHEDRAL.0>
19.5819
> <SCORE.INTRA.POLAR>
0
> <SCORE.INTRA.POLAR.0>
0
> <SCORE.INTRA.REPUL>
0
> <SCORE.INTRA.REPUL.0>
0
> <SCORE.INTRA.VDW>
2.99261
> <SCORE.INTRA.VDW.0>
-5.2787
> <SCORE.INTRA.norm>
0.00302838
> <SCORE.RESTR>
0
> <SCORE.RESTR.CAVITY>
0
> <SCORE.RESTR.norm>
0
> <SCORE.SYSTEM>
0
> <SCORE.SYSTEM.CONST>
0
> <SCORE.SYSTEM.DIHEDRAL>
0
> <SCORE.SYSTEM.norm>
0
> <SCORE.heavy>
32
> <SCORE.norm>
-1.30494
$$$$
然后
awk '/^> <Name>/{getline;printf "%s\t",[=11=]}/^> <SCORE\.INTER>/{getline;print [=11=]}' file.txt
输出
ZINC000169748276 -41.8551
解释:getline
导致 GNU AWK
加载下一行,因此 [=25=]
成为当前行之后的行内容。当遇到行首 (^
) 处的 > <Name>
时,加载下一行并打印它,然后是 TAB 以开始于 > <SCORE.INTER>
的行,加载下一行并打印它。注意 .
有特殊含义,需要转义。
(在 gawk 4.2.1 中测试)
使用任何 awk:
$ awk -v OFS='\t' '
/^>/ { tag=; next }
NF { f[tag]= }
[=10=] == "$$$$" { print f["<Name>"], f["<SCORE.INTER>"] }
' file
ZINC000169748276 -41.8551
以上假定包含 $$$$
的行用于分隔输入记录。
请注意,使用这种首先创建一个数组(上面的 f[]
)将 tags/names 映射到它们的值的方法,您可以按您喜欢的任何顺序打印任何您喜欢的值,转换整个事物到 CSV,通过名称等将值与其他值进行比较。你可以写这样的东西来分析你的数据区域和输出报告等:
awk -v OFS='\t' '
/^>/ { tag=; next }
NF { f[tag]= }
[=11=] == "$$$$" {
if ( (f["<SCORE.INTRA.POLAR>"] >= f["<SCORE.INTRA.REPUL>"]) &&
(f["<SCORE.RESTR.CAVITY>"] == 27) ) {
print f["<Name>"]
for ( tag in f ) {
if ( tag ~ /SCORE/ ) {
print f[tag]
}
}
}
}
' file
如果您曾经考虑过使用 getline
,请参阅 http://awk.freeshell.org/AllAboutGetline 了解为什么它通常是错误的方法。
我想从 SDF 文件中提取数据。
我想将 > <Name>
和 > <SCORE.INTER>
值保存在 .tsv 文件中。
有什么方法可以快速解决吗?通过 awk?
提前致谢。
SDF文件由数千个Block组成。文件的一个块如下所示:
ZINC000169748276
38 39 0 0 0 0 0 0 0 0999 V2000
11.2318 3.6419 22.3134 C 0 0 0 0 0 0
12.5621 3.7685 22.2617 C 0 0 0 0 0 0
13.0725 5.1806 22.3121 C 0 0 0 0 0 0
10.8850 6.0303 22.4462 C 0 0 0 0 0 0
13.4310 2.6268 22.1614 C 0 0 0 0 0 0
12.9848 1.3691 22.0592 C 0 0 0 0 0 0
8.2548 4.7608 21.1375 C 0 0 0 0 0 0
7.1479 3.7322 21.1132 C 0 0 0 0 0 0
7.7728 2.5366 21.8185 C 0 0 0 0 0 0
8.9539 4.4605 22.4534 C 0 0 0 0 0 0
13.8873 0.1824 21.9500 C 0 0 0 0 0 0
8.5117 1.6060 20.8656 C 0 0 0 0 0 0
12.2544 6.2009 22.3970 N 0 0 0 0 0 0
10.3635 4.7178 22.4055 N 0 0 0 0 0 0
14.4254 5.4429 22.2718 N 0 0 0 0 0 0
13.7646 -0.5167 20.6443 N 0 3 0 0 0 0
6.5529 -4.6019 19.9460 O 0 5 0 0 0 0
8.2203 -4.0310 21.8048 O 0 5 0 0 0 0
6.8149 1.6459 17.3793 O 0 5 0 0 0 0
5.4231 -2.1179 18.5726 O 0 5 0 0 0 0
10.1403 7.0090 22.5243 O 0 0 0 0 0 0
5.7155 -3.6365 22.1679 O 0 0 0 0 0 0
5.6431 1.8811 19.7228 O 0 0 0 0 0 0
5.0295 -0.6218 20.7059 O 0 0 0 0 0 0
8.7342 3.0736 22.7475 O 0 0 0 0 0 0
6.0324 4.2091 21.8626 O 0 0 0 0 0 0
8.1857 1.9631 19.5323 O 0 0 0 0 0 0
7.0232 -2.2197 20.5667 O 0 0 0 0 0 0
7.0081 -0.1966 19.1450 O 0 0 0 0 0 0
6.8632 -3.7464 21.1697 P 0 0 0 0 0 0
6.7991 1.4009 18.8725 P 0 0 0 0 0 0
5.9605 -1.3044 19.7288 P 0 0 0 0 0 0
15.0444 4.6730 22.2089 H 0 0 0 0 0 0
14.7148 6.3890 22.3078 H 0 0 0 0 0 0
14.3405 -1.3642 20.6292 H 0 0 0 0 0 0
14.0706 0.0896 19.8769 H 0 0 0 0 0 0
12.7928 -0.7891 20.4667 H 0 0 0 0 0 0
5.3352 3.5319 21.8055 H 0 0 0 0 0 0
1 2 2 0 0 0
1 14 1 0 0 0
2 3 1 0 0 0
2 5 1 0 0 0
3 13 2 0 0 0
3 15 1 0 0 0
4 13 1 0 0 0
4 14 1 0 0 0
4 21 2 0 0 0
5 6 2 0 0 0
6 11 1 0 0 0
7 8 1 0 0 0
7 10 1 0 0 0
8 9 1 0 0 0
8 26 1 0 0 0
9 12 1 0 0 0
9 25 1 0 0 0
10 14 1 0 0 0
10 25 1 0 0 0
11 16 1 0 0 0
12 27 1 0 0 0
17 30 1 0 0 0
18 30 1 0 0 0
19 31 1 0 0 0
20 32 1 0 0 0
22 30 2 0 0 0
23 31 2 0 0 0
24 32 2 0 0 0
27 31 1 0 0 0
28 30 1 0 0 0
28 32 1 0 0 0
29 31 1 0 0 0
29 32 1 0 0 0
15 33 1 0 0 0
15 34 1 0 0 0
16 35 1 0 0 0
16 36 1 0 0 0
16 37 1 0 0 0
26 38 1 0 0 0
M END
> <CHROM.1>
2.74804207,-114.83879868,178.63419806,-11.86097681,-104.18799792,-175.61867989
-82.60305529,-167.43897154,58.52671946,-50.63759561,-111.24083331,101.74294800
8.69431853,1.29062552,20.98254072,-0.89039136,0.27787279,-3.08051579
> <Name>
ZINC000169748276
> <RI>
1.76083e+07
> <Rbt.Executable>
rbdock/0.1.0
> <Rbt.Library>
librxdock.so/0.1.0
> <SCORE>
-41.7582
> <SCORE.INTER>
-41.8551
> <SCORE.INTER.CONST>
1
> <SCORE.INTER.POLAR>
-4.96496
> <SCORE.INTER.REPUL>
0
> <SCORE.INTER.ROT>
10
> <SCORE.INTER.VDW>
-40.3742
> <SCORE.INTER.norm>
-1.30797
> <SCORE.INTRA>
0.0969082
> <SCORE.INTRA.DIHEDRAL>
-5.79141
> <SCORE.INTRA.DIHEDRAL.0>
19.5819
> <SCORE.INTRA.POLAR>
0
> <SCORE.INTRA.POLAR.0>
0
> <SCORE.INTRA.REPUL>
0
> <SCORE.INTRA.REPUL.0>
0
> <SCORE.INTRA.VDW>
2.99261
> <SCORE.INTRA.VDW.0>
-5.2787
> <SCORE.INTRA.norm>
0.00302838
> <SCORE.RESTR>
0
> <SCORE.RESTR.CAVITY>
0
> <SCORE.RESTR.norm>
0
> <SCORE.SYSTEM>
0
> <SCORE.SYSTEM.CONST>
0
> <SCORE.SYSTEM.DIHEDRAL>
0
> <SCORE.SYSTEM.norm>
0
> <SCORE.heavy>
32
> <SCORE.norm>
-1.30494
$$$$
.tsv 文件应如下所示:
ZINC000169748276 -41.8551
ZINC000079214514 -41.7892
ZINC000195993528 -40.9293
为什么 awk
?
Prompt> grep -A 1 -i "<NAME>" test.txt | tail -n 1
ZINC000169748276
Prompt> grep -A 1 -i "<SCORE.INTER>" test.txt | tail -n 1
-41.8551
如您所见,grep
要容易得多。
-A 1
表示“也取下 1 行”。
经过一番讨论,这是最终的解决方案:
grep -A 1 -i "<SCORE.INTER>" test.sdf | grep -v '^>' | grep -v '^--' >> results
I want to save the
> <NAME>
and> <SCORE.INTER>
values in a .tsv file. Is there any way for a quick solution e.g. via awk?
您的文件有 > <Name>
而不是 > <NAME>
(如果您以 case-sensitive 方式匹配,则有重要区别)。我会按照以下方式使用 GNU AWK
完成此任务(假设 > <Name>
通常在 > <SCORE.INTER>
之前并且每个 > <SCORE.INTER>
都有相应的 > <Name>
)让 file.txt
内容
ZINC000169748276
38 39 0 0 0 0 0 0 0 0999 V2000
11.2318 3.6419 22.3134 C 0 0 0 0 0 0
12.5621 3.7685 22.2617 C 0 0 0 0 0 0
13.0725 5.1806 22.3121 C 0 0 0 0 0 0
10.8850 6.0303 22.4462 C 0 0 0 0 0 0
13.4310 2.6268 22.1614 C 0 0 0 0 0 0
12.9848 1.3691 22.0592 C 0 0 0 0 0 0
8.2548 4.7608 21.1375 C 0 0 0 0 0 0
7.1479 3.7322 21.1132 C 0 0 0 0 0 0
7.7728 2.5366 21.8185 C 0 0 0 0 0 0
8.9539 4.4605 22.4534 C 0 0 0 0 0 0
13.8873 0.1824 21.9500 C 0 0 0 0 0 0
8.5117 1.6060 20.8656 C 0 0 0 0 0 0
12.2544 6.2009 22.3970 N 0 0 0 0 0 0
10.3635 4.7178 22.4055 N 0 0 0 0 0 0
14.4254 5.4429 22.2718 N 0 0 0 0 0 0
13.7646 -0.5167 20.6443 N 0 3 0 0 0 0
6.5529 -4.6019 19.9460 O 0 5 0 0 0 0
8.2203 -4.0310 21.8048 O 0 5 0 0 0 0
6.8149 1.6459 17.3793 O 0 5 0 0 0 0
5.4231 -2.1179 18.5726 O 0 5 0 0 0 0
10.1403 7.0090 22.5243 O 0 0 0 0 0 0
5.7155 -3.6365 22.1679 O 0 0 0 0 0 0
5.6431 1.8811 19.7228 O 0 0 0 0 0 0
5.0295 -0.6218 20.7059 O 0 0 0 0 0 0
8.7342 3.0736 22.7475 O 0 0 0 0 0 0
6.0324 4.2091 21.8626 O 0 0 0 0 0 0
8.1857 1.9631 19.5323 O 0 0 0 0 0 0
7.0232 -2.2197 20.5667 O 0 0 0 0 0 0
7.0081 -0.1966 19.1450 O 0 0 0 0 0 0
6.8632 -3.7464 21.1697 P 0 0 0 0 0 0
6.7991 1.4009 18.8725 P 0 0 0 0 0 0
5.9605 -1.3044 19.7288 P 0 0 0 0 0 0
15.0444 4.6730 22.2089 H 0 0 0 0 0 0
14.7148 6.3890 22.3078 H 0 0 0 0 0 0
14.3405 -1.3642 20.6292 H 0 0 0 0 0 0
14.0706 0.0896 19.8769 H 0 0 0 0 0 0
12.7928 -0.7891 20.4667 H 0 0 0 0 0 0
5.3352 3.5319 21.8055 H 0 0 0 0 0 0
1 2 2 0 0 0
1 14 1 0 0 0
2 3 1 0 0 0
2 5 1 0 0 0
3 13 2 0 0 0
3 15 1 0 0 0
4 13 1 0 0 0
4 14 1 0 0 0
4 21 2 0 0 0
5 6 2 0 0 0
6 11 1 0 0 0
7 8 1 0 0 0
7 10 1 0 0 0
8 9 1 0 0 0
8 26 1 0 0 0
9 12 1 0 0 0
9 25 1 0 0 0
10 14 1 0 0 0
10 25 1 0 0 0
11 16 1 0 0 0
12 27 1 0 0 0
17 30 1 0 0 0
18 30 1 0 0 0
19 31 1 0 0 0
20 32 1 0 0 0
22 30 2 0 0 0
23 31 2 0 0 0
24 32 2 0 0 0
27 31 1 0 0 0
28 30 1 0 0 0
28 32 1 0 0 0
29 31 1 0 0 0
29 32 1 0 0 0
15 33 1 0 0 0
15 34 1 0 0 0
16 35 1 0 0 0
16 36 1 0 0 0
16 37 1 0 0 0
26 38 1 0 0 0
M END
> <CHROM.1>
2.74804207,-114.83879868,178.63419806,-11.86097681,-104.18799792,-175.61867989
-82.60305529,-167.43897154,58.52671946,-50.63759561,-111.24083331,101.74294800
8.69431853,1.29062552,20.98254072,-0.89039136,0.27787279,-3.08051579
> <Name>
ZINC000169748276
> <RI>
1.76083e+07
> <Rbt.Executable>
rbdock/0.1.0
> <Rbt.Library>
librxdock.so/0.1.0
> <SCORE>
-41.7582
> <SCORE.INTER>
-41.8551
> <SCORE.INTER.CONST>
1
> <SCORE.INTER.POLAR>
-4.96496
> <SCORE.INTER.REPUL>
0
> <SCORE.INTER.ROT>
10
> <SCORE.INTER.VDW>
-40.3742
> <SCORE.INTER.norm>
-1.30797
> <SCORE.INTRA>
0.0969082
> <SCORE.INTRA.DIHEDRAL>
-5.79141
> <SCORE.INTRA.DIHEDRAL.0>
19.5819
> <SCORE.INTRA.POLAR>
0
> <SCORE.INTRA.POLAR.0>
0
> <SCORE.INTRA.REPUL>
0
> <SCORE.INTRA.REPUL.0>
0
> <SCORE.INTRA.VDW>
2.99261
> <SCORE.INTRA.VDW.0>
-5.2787
> <SCORE.INTRA.norm>
0.00302838
> <SCORE.RESTR>
0
> <SCORE.RESTR.CAVITY>
0
> <SCORE.RESTR.norm>
0
> <SCORE.SYSTEM>
0
> <SCORE.SYSTEM.CONST>
0
> <SCORE.SYSTEM.DIHEDRAL>
0
> <SCORE.SYSTEM.norm>
0
> <SCORE.heavy>
32
> <SCORE.norm>
-1.30494
$$$$
然后
awk '/^> <Name>/{getline;printf "%s\t",[=11=]}/^> <SCORE\.INTER>/{getline;print [=11=]}' file.txt
输出
ZINC000169748276 -41.8551
解释:getline
导致 GNU AWK
加载下一行,因此 [=25=]
成为当前行之后的行内容。当遇到行首 (^
) 处的 > <Name>
时,加载下一行并打印它,然后是 TAB 以开始于 > <SCORE.INTER>
的行,加载下一行并打印它。注意 .
有特殊含义,需要转义。
(在 gawk 4.2.1 中测试)
使用任何 awk:
$ awk -v OFS='\t' '
/^>/ { tag=; next }
NF { f[tag]= }
[=10=] == "$$$$" { print f["<Name>"], f["<SCORE.INTER>"] }
' file
ZINC000169748276 -41.8551
以上假定包含 $$$$
的行用于分隔输入记录。
请注意,使用这种首先创建一个数组(上面的 f[]
)将 tags/names 映射到它们的值的方法,您可以按您喜欢的任何顺序打印任何您喜欢的值,转换整个事物到 CSV,通过名称等将值与其他值进行比较。你可以写这样的东西来分析你的数据区域和输出报告等:
awk -v OFS='\t' '
/^>/ { tag=; next }
NF { f[tag]= }
[=11=] == "$$$$" {
if ( (f["<SCORE.INTRA.POLAR>"] >= f["<SCORE.INTRA.REPUL>"]) &&
(f["<SCORE.RESTR.CAVITY>"] == 27) ) {
print f["<Name>"]
for ( tag in f ) {
if ( tag ~ /SCORE/ ) {
print f[tag]
}
}
}
}
' file
如果您曾经考虑过使用 getline
,请参阅 http://awk.freeshell.org/AllAboutGetline 了解为什么它通常是错误的方法。