awk:处理日志和搜索模式

awk: processing log and search pattern

我正在处理按以下格式排列的日志文件:

fƒdfFinding intramodel H-bonds
Constraints relaxed by 0.5 angstroms and 20 degrees
Models used:
    1.1 SarsCov2_structure49R_nsp5holo_rep1.pdb
    1.2 SarsCov2_structure49R_nsp5holo_rep1.pdb
    1.3 SarsCov2_structure49R_nsp5holo_rep1.pdb
    1.4 SarsCov2_structure49R_nsp5holo_rep1.pdb
    1.5 SarsCov2_structure49R_nsp5holo_rep1.pdb
    1.6 SarsCov2_structure49R_nsp5holo_rep1.pdb
    1.7 SarsCov2_structure49R_nsp5holo_rep1.pdb
    1.8 SarsCov2_structure49R_nsp5holo_rep1.pdb
    1.9 SarsCov2_structure49R_nsp5holo_rep1.pdb
    1.10 SarsCov2_structure49R_nsp5holo_rep1.pdb
    1.11 SarsCov2_structure49R_nsp5holo_rep1.pdb
    1.12 SarsCov2_structure49R_nsp5holo_rep1.pdb
    1.13 SarsCov2_structure49R_nsp5holo_rep1.pdb
    1.14 SarsCov2_structure49R_nsp5holo_rep1.pdb

14 H-bonds
H-bonds (donor, acceptor, hydrogen, D..A dist, D-H..A dist):
SarsCov2_structure49R_nsp5holo_rep1.pdb #1.1/? ASN 142 ND2   SarsCov2_structure49R_nsp5holo_rep1.pdb #1.1/A UNL 888 O   SarsCov2_structure49R_nsp5holo_rep1.pdb #1.1/? ASN 142 1HD2   3.102  2.145
SarsCov2_structure49R_nsp5holo_rep1.pdb #1.3/? GLU 166 N     SarsCov2_structure49R_nsp5holo_rep1.pdb #1.3/A UNL 888 O   SarsCov2_structure49R_nsp5holo_rep1.pdb #1.3/? GLU 166 H      3.011  2.024
SarsCov2_structure49R_nsp5holo_rep1.pdb #1.4/? GLU 166 N     SarsCov2_structure49R_nsp5holo_rep1.pdb #1.4/A UNL 888 O   SarsCov2_structure49R_nsp5holo_rep1.pdb #1.4/? GLU 166 H      3.037  2.132
SarsCov2_structure49R_nsp5holo_rep1.pdb #1.5/? HIS 163 NE2   SarsCov2_structure49R_nsp5holo_rep1.pdb #1.5/A UNL 888 O   no hydrogen                                                   3.388  N/A
SarsCov2_structure49R_nsp5holo_rep1.pdb #1.5/? GLU 166 N     SarsCov2_structure49R_nsp5holo_rep1.pdb #1.5/A UNL 888 O   SarsCov2_structure49R_nsp5holo_rep1.pdb #1.5/? GLU 166 H      2.806  1.792
SarsCov2_structure49R_nsp5holo_rep1.pdb #1.7/? THR 26 N      SarsCov2_structure49R_nsp5holo_rep1.pdb #1.7/A UNL 888 O   SarsCov2_structure49R_nsp5holo_rep1.pdb #1.7/? THR 26 H       3.093  2.142
SarsCov2_structure49R_nsp5holo_rep1.pdb #1.7/? GLY 143 N     SarsCov2_structure49R_nsp5holo_rep1.pdb #1.7/A UNL 888 O   SarsCov2_structure49R_nsp5holo_rep1.pdb #1.7/? GLY 143 H      3.030  2.193
SarsCov2_structure49R_nsp5holo_rep1.pdb #1.9/? GLN 189 NE2   SarsCov2_structure49R_nsp5holo_rep1.pdb #1.9/A UNL 888 O   SarsCov2_structure49R_nsp5holo_rep1.pdb #1.9/? GLN 189 2HE2   3.052  2.301
SarsCov2_structure49R_nsp5holo_rep1.pdb #1.10/? GLU 166 N    SarsCov2_structure49R_nsp5holo_rep1.pdb #1.10/A UNL 888 O  SarsCov2_structure49R_nsp5holo_rep1.pdb #1.10/? GLU 166 H     2.854  1.868
SarsCov2_structure49R_nsp5holo_rep1.pdb #1.12/? GLY 143 N    SarsCov2_structure49R_nsp5holo_rep1.pdb #1.12/A UNL 888 O  SarsCov2_structure49R_nsp5holo_rep1.pdb #1.12/? GLY 143 H     3.103  2.070
SarsCov2_structure49R_nsp5holo_rep1.pdb #1.13/? GLY 143 N    SarsCov2_structure49R_nsp5holo_rep1.pdb #1.13/A UNL 888 O  SarsCov2_structure49R_nsp5holo_rep1.pdb #1.13/? GLY 143 H     3.161  2.224
SarsCov2_structure49R_nsp5holo_rep1.pdb #1.13/? CYS 145 SG   SarsCov2_structure49R_nsp5holo_rep1.pdb #1.13/A UNL 888 O  SarsCov2_structure49R_nsp5holo_rep1.pdb #1.13/? CYS 145 HG    3.421  2.842
SarsCov2_structure49R_nsp5holo_rep1.pdb #1.14/? ASN 142 ND2  SarsCov2_structure49R_nsp5holo_rep1.pdb #1.14/A UNL 888 O  SarsCov2_structure49R_nsp5holo_rep1.pdb #1.14/? ASN 142 2HD2  3.055  2.465
SarsCov2_structure49R_nsp5holo_rep1.pdb #1.14/? CYS 145 N    SarsCov2_structure49R_nsp5holo_rep1.pdb #1.14/A UNL 888 O  SarsCov2_structure49R_nsp5holo_rep1.pdb #1.14/? CYS 145 H     2.924  2.143

我需要找到“GLU 166 N”模式的第一次出现,并在模式之前的同一行上打印与此模式关联的数字 #1.number/?。所以在这个例子中,检测到的数字应该是 3(因为关联数字是#1.3/?)。

我将从基本模式检测开始

awk '/GLU 166 N/' file

但是如何正确找到模式之前定义的数字并将其打印为输出?最后,如果找不到模式,我希望脚本打印 1.

$ awk -vn=1 '/GLU 166 N/ {gsub(/.*\.|\/\?/,"",); n=; exit} END {print n}' file
3
$ awk -vn=1 '/GLU 166 N/ {gsub(/.*\.|\/\?/,"",); n=; exit} END {print n}' /dev/null
1

您要查找的内容在第二个字段中 (</code>)。 <code>gsub(/.*\.|\/\?/,"",)</code> 中的所有前导字符替换为(包括)句点,并将尾随 <code>/? 替换为空字符串。

如果有支持gensub功能的GNU awk,请试一下:

awk '/GLU 166 N/ {
    print gensub(/^.*#1\.([0-9]+)\/\? GLU 166 N.*$/, "\1", 1)
    exit
}'  file

正则表达式 ^.*#1\.([0-9]+)/\? GLU 166 N.*$ 匹配带有子字符串 #1.<number>/? "GLU 166 N 的行。 <number> 部分,在正则表达式中用括号括起来,如 ([0-9]+) 被捕获为第 1 组,然后整行被第 1 组替换,这被指定为替换 \1, 然后它被打印为结果。
或者你可以说 GNU sed 为:

sed -nE '0,/GLU 166 N/s|^.*#1\.([0-9]+)/\? GLU 166 N.*||p' file

地址 0,/pattern/,其中 0 特定于 GNU sed 作为起始行,使脚本在第一个模式匹配后立即退出。

将 GNU awk 用于 match() 的第三个参数:

$ awk 'match([=10=],/([0-9]+).. GLU 166 N /,a){print a[1]; exit}' file
3

或使用任何 awk:

$ awk 'match([=11=],/[0-9]+.. GLU 166 N /){sub("/.*",""); print substr([=11=],RSTART); exit}' file
3

$ awk 'match([=11=],/[0-9]+.. GLU 166 N /){print substr([=11=],RSTART,RLENGTH-13); exit}' file
3

如果awk不是必需的,您可以使用grepcut。简单就好。

λ cat input.txt
fƒdfFinding intramodel H-bonds
Constraints relaxed by 0.5 angstroms and 20 degrees
Models used:
    1.1 SarsCov2_structure49R_nsp5holo_rep1.pdb
    1.2 SarsCov2_structure49R_nsp5holo_rep1.pdb
    1.3 SarsCov2_structure49R_nsp5holo_rep1.pdb
    1.4 SarsCov2_structure49R_nsp5holo_rep1.pdb
    1.5 SarsCov2_structure49R_nsp5holo_rep1.pdb
    1.6 SarsCov2_structure49R_nsp5holo_rep1.pdb
    1.7 SarsCov2_structure49R_nsp5holo_rep1.pdb
    1.8 SarsCov2_structure49R_nsp5holo_rep1.pdb
    1.9 SarsCov2_structure49R_nsp5holo_rep1.pdb
    1.10 SarsCov2_structure49R_nsp5holo_rep1.pdb
    1.11 SarsCov2_structure49R_nsp5holo_rep1.pdb
    1.12 SarsCov2_structure49R_nsp5holo_rep1.pdb
    1.13 SarsCov2_structure49R_nsp5holo_rep1.pdb
    1.14 SarsCov2_structure49R_nsp5holo_rep1.pdb

14 H-bonds
H-bonds (donor, acceptor, hydrogen, D..A dist, D-H..A dist):
SarsCov2_structure49R_nsp5holo_rep1.pdb #1.1/? ASN 142 ND2   SarsCov2_structure49R_nsp5holo_rep1.pdb #1.1/A UNL 888 O   SarsCov2_structure49R_nsp5holo_rep1.pdb #1.1/? ASN 142 1HD2   3.102  2.145
SarsCov2_structure49R_nsp5holo_rep1.pdb #1.3/? GLU 166 N     SarsCov2_structure49R_nsp5holo_rep1.pdb #1.3/A UNL 888 O   SarsCov2_structure49R_nsp5holo_rep1.pdb #1.3/? GLU 166 H      3.011  2.024
SarsCov2_structure49R_nsp5holo_rep1.pdb #1.4/? GLU 166 N     SarsCov2_structure49R_nsp5holo_rep1.pdb #1.4/A UNL 888 O   SarsCov2_structure49R_nsp5holo_rep1.pdb #1.4/? GLU 166 H      3.037  2.132
SarsCov2_structure49R_nsp5holo_rep1.pdb #1.5/? HIS 163 NE2   SarsCov2_structure49R_nsp5holo_rep1.pdb #1.5/A UNL 888 O   no hydrogen                                                   3.388  N/A
SarsCov2_structure49R_nsp5holo_rep1.pdb #1.5/? GLU 166 N     SarsCov2_structure49R_nsp5holo_rep1.pdb #1.5/A UNL 888 O   SarsCov2_structure49R_nsp5holo_rep1.pdb #1.5/? GLU 166 H      2.806  1.792
SarsCov2_structure49R_nsp5holo_rep1.pdb #1.7/? THR 26 N      SarsCov2_structure49R_nsp5holo_rep1.pdb #1.7/A UNL 888 O   SarsCov2_structure49R_nsp5holo_rep1.pdb #1.7/? THR 26 H       3.093  2.142
SarsCov2_structure49R_nsp5holo_rep1.pdb #1.7/? GLY 143 N     SarsCov2_structure49R_nsp5holo_rep1.pdb #1.7/A UNL 888 O   SarsCov2_structure49R_nsp5holo_rep1.pdb #1.7/? GLY 143 H      3.030  2.193
SarsCov2_structure49R_nsp5holo_rep1.pdb #1.9/? GLN 189 NE2   SarsCov2_structure49R_nsp5holo_rep1.pdb #1.9/A UNL 888 O   SarsCov2_structure49R_nsp5holo_rep1.pdb #1.9/? GLN 189 2HE2   3.052  2.301
SarsCov2_structure49R_nsp5holo_rep1.pdb #1.10/? GLU 166 N    SarsCov2_structure49R_nsp5holo_rep1.pdb #1.10/A UNL 888 O  SarsCov2_structure49R_nsp5holo_rep1.pdb #1.10/? GLU 166 H     2.854  1.868
SarsCov2_structure49R_nsp5holo_rep1.pdb #1.12/? GLY 143 N    SarsCov2_structure49R_nsp5holo_rep1.pdb #1.12/A UNL 888 O  SarsCov2_structure49R_nsp5holo_rep1.pdb #1.12/? GLY 143 H     3.103  2.070
SarsCov2_structure49R_nsp5holo_rep1.pdb #1.13/? GLY 143 N    SarsCov2_structure49R_nsp5holo_rep1.pdb #1.13/A UNL 888 O  SarsCov2_structure49R_nsp5holo_rep1.pdb #1.13/? GLY 143 H     3.161  2.224
SarsCov2_structure49R_nsp5holo_rep1.pdb #1.13/? CYS 145 SG   SarsCov2_structure49R_nsp5holo_rep1.pdb #1.13/A UNL 888 O  SarsCov2_structure49R_nsp5holo_rep1.pdb #1.13/? CYS 145 HG    3.421  2.842
SarsCov2_structure49R_nsp5holo_rep1.pdb #1.14/? ASN 142 ND2  SarsCov2_structure49R_nsp5holo_rep1.pdb #1.14/A UNL 888 O  SarsCov2_structure49R_nsp5holo_rep1.pdb #1.14/? ASN 142 2HD2  3.055  2.465
SarsCov2_structure49R_nsp5holo_rep1.pdb #1.14/? CYS 145 N    SarsCov2_structure49R_nsp5holo_rep1.pdb #1.14/A UNL 888 O  SarsCov2_structure49R_nsp5holo_rep1.pdb #1.14/? CYS 145 H     2.924  2.143


grep -om1 '[[:digit:]]*/? GLU 166 N' input.txt | cut -d/ -f1
3

未找到模式时打印1

{ grep -om1 '[[:digit:]]*/? GLU 166 N' input.txt || echo 1; } | cut -d/ -f1