在 bash 环境中从 sdf 文件中提取数据

Extract data from sdf file in bash environment

我想从 SDF 文件中提取数据。

我想将 > <Name>> <SCORE.INTER> 值保存在 .tsv 文件中。 有什么方法可以快速解决吗?通过 awk? 提前致谢。

SDF文件由数千个Block组成。文件的一个块如下所示:

ZINC000169748276

 38 39  0  0  0  0  0  0  0  0999 V2000
   11.2318    3.6419   22.3134 C   0  0  0  0  0  0
   12.5621    3.7685   22.2617 C   0  0  0  0  0  0
   13.0725    5.1806   22.3121 C   0  0  0  0  0  0
   10.8850    6.0303   22.4462 C   0  0  0  0  0  0
   13.4310    2.6268   22.1614 C   0  0  0  0  0  0
   12.9848    1.3691   22.0592 C   0  0  0  0  0  0
    8.2548    4.7608   21.1375 C   0  0  0  0  0  0
    7.1479    3.7322   21.1132 C   0  0  0  0  0  0
    7.7728    2.5366   21.8185 C   0  0  0  0  0  0
    8.9539    4.4605   22.4534 C   0  0  0  0  0  0
   13.8873    0.1824   21.9500 C   0  0  0  0  0  0
    8.5117    1.6060   20.8656 C   0  0  0  0  0  0
   12.2544    6.2009   22.3970 N   0  0  0  0  0  0
   10.3635    4.7178   22.4055 N   0  0  0  0  0  0
   14.4254    5.4429   22.2718 N   0  0  0  0  0  0
   13.7646   -0.5167   20.6443 N   0  3  0  0  0  0
    6.5529   -4.6019   19.9460 O   0  5  0  0  0  0
    8.2203   -4.0310   21.8048 O   0  5  0  0  0  0
    6.8149    1.6459   17.3793 O   0  5  0  0  0  0
    5.4231   -2.1179   18.5726 O   0  5  0  0  0  0
   10.1403    7.0090   22.5243 O   0  0  0  0  0  0
    5.7155   -3.6365   22.1679 O   0  0  0  0  0  0
    5.6431    1.8811   19.7228 O   0  0  0  0  0  0
    5.0295   -0.6218   20.7059 O   0  0  0  0  0  0
    8.7342    3.0736   22.7475 O   0  0  0  0  0  0
    6.0324    4.2091   21.8626 O   0  0  0  0  0  0
    8.1857    1.9631   19.5323 O   0  0  0  0  0  0
    7.0232   -2.2197   20.5667 O   0  0  0  0  0  0
    7.0081   -0.1966   19.1450 O   0  0  0  0  0  0
    6.8632   -3.7464   21.1697 P   0  0  0  0  0  0
    6.7991    1.4009   18.8725 P   0  0  0  0  0  0
    5.9605   -1.3044   19.7288 P   0  0  0  0  0  0
   15.0444    4.6730   22.2089 H   0  0  0  0  0  0
   14.7148    6.3890   22.3078 H   0  0  0  0  0  0
   14.3405   -1.3642   20.6292 H   0  0  0  0  0  0
   14.0706    0.0896   19.8769 H   0  0  0  0  0  0
   12.7928   -0.7891   20.4667 H   0  0  0  0  0  0
    5.3352    3.5319   21.8055 H   0  0  0  0  0  0
  1  2  2  0  0  0
  1 14  1  0  0  0
  2  3  1  0  0  0
  2  5  1  0  0  0
  3 13  2  0  0  0
  3 15  1  0  0  0
  4 13  1  0  0  0
  4 14  1  0  0  0
  4 21  2  0  0  0
  5  6  2  0  0  0
  6 11  1  0  0  0
  7  8  1  0  0  0
  7 10  1  0  0  0
  8  9  1  0  0  0
  8 26  1  0  0  0
  9 12  1  0  0  0
  9 25  1  0  0  0
 10 14  1  0  0  0
 10 25  1  0  0  0
 11 16  1  0  0  0
 12 27  1  0  0  0
 17 30  1  0  0  0
 18 30  1  0  0  0
 19 31  1  0  0  0
 20 32  1  0  0  0
 22 30  2  0  0  0
 23 31  2  0  0  0
 24 32  2  0  0  0
 27 31  1  0  0  0
 28 30  1  0  0  0
 28 32  1  0  0  0
 29 31  1  0  0  0
 29 32  1  0  0  0
 15 33  1  0  0  0
 15 34  1  0  0  0
 16 35  1  0  0  0
 16 36  1  0  0  0
 16 37  1  0  0  0
 26 38  1  0  0  0
M  END
>  <CHROM.1>
2.74804207,-114.83879868,178.63419806,-11.86097681,-104.18799792,-175.61867989
-82.60305529,-167.43897154,58.52671946,-50.63759561,-111.24083331,101.74294800
8.69431853,1.29062552,20.98254072,-0.89039136,0.27787279,-3.08051579

>  <Name>
ZINC000169748276

>  <RI>
1.76083e+07


>  <Rbt.Executable>
rbdock/0.1.0

>  <Rbt.Library>
librxdock.so/0.1.0

>  <SCORE>
-41.7582

>  <SCORE.INTER>
-41.8551

>  <SCORE.INTER.CONST>
1

>  <SCORE.INTER.POLAR>
-4.96496

>  <SCORE.INTER.REPUL>
0

>  <SCORE.INTER.ROT>
10

>  <SCORE.INTER.VDW>
-40.3742

>  <SCORE.INTER.norm>
-1.30797

>  <SCORE.INTRA>
0.0969082

>  <SCORE.INTRA.DIHEDRAL>
-5.79141

>  <SCORE.INTRA.DIHEDRAL.0>
19.5819

>  <SCORE.INTRA.POLAR>
0

>  <SCORE.INTRA.POLAR.0>
0

>  <SCORE.INTRA.REPUL>
0

>  <SCORE.INTRA.REPUL.0>
0

>  <SCORE.INTRA.VDW>
2.99261

>  <SCORE.INTRA.VDW.0>
-5.2787

>  <SCORE.INTRA.norm>
0.00302838

>  <SCORE.RESTR>
0

>  <SCORE.RESTR.CAVITY>
0

>  <SCORE.RESTR.norm>
0

>  <SCORE.SYSTEM>
0

>  <SCORE.SYSTEM.CONST>
0

>  <SCORE.SYSTEM.DIHEDRAL>
0

>  <SCORE.SYSTEM.norm>
0

>  <SCORE.heavy>
32

>  <SCORE.norm>
-1.30494

$$$$

.tsv 文件应如下所示:

ZINC000169748276    -41.8551
ZINC000079214514    -41.7892
ZINC000195993528    -40.9293

为什么 awk

Prompt> grep -A 1 -i "<NAME>" test.txt | tail -n 1
ZINC000169748276
Prompt> grep -A 1 -i "<SCORE.INTER>" test.txt | tail -n 1
-41.8551

如您所见,grep要容易得多。

-A 1 表示“也取下 1 行”。

经过一番讨论,这是最终的解决方案:

grep -A 1 -i "<SCORE.INTER>" test.sdf | grep -v '^>' | grep -v '^--' >> results

I want to save the > <NAME> and > <SCORE.INTER> values in a .tsv file. Is there any way for a quick solution e.g. via awk?

您的文件有 > <Name> 而不是 > <NAME>(如果您以 case-sensitive 方式匹配,则有重要区别)。我会按照以下方式使用 GNU AWK 完成此任务(假设 > <Name> 通常在 > <SCORE.INTER> 之前并且每个 > <SCORE.INTER> 都有相应的 > <Name>)让 file.txt内容

ZINC000169748276

 38 39  0  0  0  0  0  0  0  0999 V2000
   11.2318    3.6419   22.3134 C   0  0  0  0  0  0
   12.5621    3.7685   22.2617 C   0  0  0  0  0  0
   13.0725    5.1806   22.3121 C   0  0  0  0  0  0
   10.8850    6.0303   22.4462 C   0  0  0  0  0  0
   13.4310    2.6268   22.1614 C   0  0  0  0  0  0
   12.9848    1.3691   22.0592 C   0  0  0  0  0  0
    8.2548    4.7608   21.1375 C   0  0  0  0  0  0
    7.1479    3.7322   21.1132 C   0  0  0  0  0  0
    7.7728    2.5366   21.8185 C   0  0  0  0  0  0
    8.9539    4.4605   22.4534 C   0  0  0  0  0  0
   13.8873    0.1824   21.9500 C   0  0  0  0  0  0
    8.5117    1.6060   20.8656 C   0  0  0  0  0  0
   12.2544    6.2009   22.3970 N   0  0  0  0  0  0
   10.3635    4.7178   22.4055 N   0  0  0  0  0  0
   14.4254    5.4429   22.2718 N   0  0  0  0  0  0
   13.7646   -0.5167   20.6443 N   0  3  0  0  0  0
    6.5529   -4.6019   19.9460 O   0  5  0  0  0  0
    8.2203   -4.0310   21.8048 O   0  5  0  0  0  0
    6.8149    1.6459   17.3793 O   0  5  0  0  0  0
    5.4231   -2.1179   18.5726 O   0  5  0  0  0  0
   10.1403    7.0090   22.5243 O   0  0  0  0  0  0
    5.7155   -3.6365   22.1679 O   0  0  0  0  0  0
    5.6431    1.8811   19.7228 O   0  0  0  0  0  0
    5.0295   -0.6218   20.7059 O   0  0  0  0  0  0
    8.7342    3.0736   22.7475 O   0  0  0  0  0  0
    6.0324    4.2091   21.8626 O   0  0  0  0  0  0
    8.1857    1.9631   19.5323 O   0  0  0  0  0  0
    7.0232   -2.2197   20.5667 O   0  0  0  0  0  0
    7.0081   -0.1966   19.1450 O   0  0  0  0  0  0
    6.8632   -3.7464   21.1697 P   0  0  0  0  0  0
    6.7991    1.4009   18.8725 P   0  0  0  0  0  0
    5.9605   -1.3044   19.7288 P   0  0  0  0  0  0
   15.0444    4.6730   22.2089 H   0  0  0  0  0  0
   14.7148    6.3890   22.3078 H   0  0  0  0  0  0
   14.3405   -1.3642   20.6292 H   0  0  0  0  0  0
   14.0706    0.0896   19.8769 H   0  0  0  0  0  0
   12.7928   -0.7891   20.4667 H   0  0  0  0  0  0
    5.3352    3.5319   21.8055 H   0  0  0  0  0  0
  1  2  2  0  0  0
  1 14  1  0  0  0
  2  3  1  0  0  0
  2  5  1  0  0  0
  3 13  2  0  0  0
  3 15  1  0  0  0
  4 13  1  0  0  0
  4 14  1  0  0  0
  4 21  2  0  0  0
  5  6  2  0  0  0
  6 11  1  0  0  0
  7  8  1  0  0  0
  7 10  1  0  0  0
  8  9  1  0  0  0
  8 26  1  0  0  0
  9 12  1  0  0  0
  9 25  1  0  0  0
 10 14  1  0  0  0
 10 25  1  0  0  0
 11 16  1  0  0  0
 12 27  1  0  0  0
 17 30  1  0  0  0
 18 30  1  0  0  0
 19 31  1  0  0  0
 20 32  1  0  0  0
 22 30  2  0  0  0
 23 31  2  0  0  0
 24 32  2  0  0  0
 27 31  1  0  0  0
 28 30  1  0  0  0
 28 32  1  0  0  0
 29 31  1  0  0  0
 29 32  1  0  0  0
 15 33  1  0  0  0
 15 34  1  0  0  0
 16 35  1  0  0  0
 16 36  1  0  0  0
 16 37  1  0  0  0
 26 38  1  0  0  0
M  END
>  <CHROM.1>
2.74804207,-114.83879868,178.63419806,-11.86097681,-104.18799792,-175.61867989
-82.60305529,-167.43897154,58.52671946,-50.63759561,-111.24083331,101.74294800
8.69431853,1.29062552,20.98254072,-0.89039136,0.27787279,-3.08051579

>  <Name>
ZINC000169748276

>  <RI>
1.76083e+07


>  <Rbt.Executable>
rbdock/0.1.0

>  <Rbt.Library>
librxdock.so/0.1.0

>  <SCORE>
-41.7582

>  <SCORE.INTER>
-41.8551

>  <SCORE.INTER.CONST>
1

>  <SCORE.INTER.POLAR>
-4.96496

>  <SCORE.INTER.REPUL>
0

>  <SCORE.INTER.ROT>
10

>  <SCORE.INTER.VDW>
-40.3742

>  <SCORE.INTER.norm>
-1.30797

>  <SCORE.INTRA>
0.0969082

>  <SCORE.INTRA.DIHEDRAL>
-5.79141

>  <SCORE.INTRA.DIHEDRAL.0>
19.5819

>  <SCORE.INTRA.POLAR>
0

>  <SCORE.INTRA.POLAR.0>
0

>  <SCORE.INTRA.REPUL>
0

>  <SCORE.INTRA.REPUL.0>
0

>  <SCORE.INTRA.VDW>
2.99261

>  <SCORE.INTRA.VDW.0>
-5.2787

>  <SCORE.INTRA.norm>
0.00302838

>  <SCORE.RESTR>
0

>  <SCORE.RESTR.CAVITY>
0

>  <SCORE.RESTR.norm>
0

>  <SCORE.SYSTEM>
0

>  <SCORE.SYSTEM.CONST>
0

>  <SCORE.SYSTEM.DIHEDRAL>
0

>  <SCORE.SYSTEM.norm>
0

>  <SCORE.heavy>
32

>  <SCORE.norm>
-1.30494

$$$$

然后

awk '/^>  <Name>/{getline;printf "%s\t",[=11=]}/^>  <SCORE\.INTER>/{getline;print [=11=]}' file.txt

输出

ZINC000169748276    -41.8551

解释:getline 导致 GNU AWK 加载下一行,因此 [=25=] 成为当前行之后的行内容。当遇到行首 (​​^) 处的 > <Name> 时,加载下一行并打印它,然后是 TAB 以开始于 > <SCORE.INTER> 的行,加载下一行并打印它。注意 . 有特殊含义,需要转义。

(在 gawk 4.2.1 中测试)

使用任何 awk:

$ awk -v OFS='\t' '
    /^>/ { tag=; next }
    NF { f[tag]= }
    [=10=] == "$$$$" { print f["<Name>"], f["<SCORE.INTER>"] }
' file
ZINC000169748276        -41.8551

以上假定包含 $$$$ 的行用于分隔输入记录。

请注意,使用这种首先创建一个数组(上面的 f[])将 tags/names 映射到它们的值的方法,您可以按您喜欢的任何顺序打印任何您喜欢的值,转换整个事物到 CSV,通过名称等将值与其他值进行比较。你可以写这样的东西来分析你的数据区域和输出报告等:

awk -v OFS='\t' '
    /^>/ { tag=; next }
    NF { f[tag]= }
    [=11=] == "$$$$" {
        if (    (f["<SCORE.INTRA.POLAR>"] >= f["<SCORE.INTRA.REPUL>"]) &&
                (f["<SCORE.RESTR.CAVITY>"] == 27) ) {
            print f["<Name>"]
            for ( tag in f ) {
                if ( tag ~ /SCORE/ ) {
                    print f[tag]
                }
            }
        }
    }
' file

如果您曾经考虑过使用 getline,请参阅 http://awk.freeshell.org/AllAboutGetline 了解为什么它通常是错误的方法。