sed：分段数据中的模式搜索

Question

我正在使用 --- 将数据分成几个部分进行操作，其中块的 ID 在开头指示为每个块的开头

# an example with 4 blocks: 06I, 5p9, Y6J, jacks18

06I: 18 H-bonds
H-bonds (donor, acceptor, hydrogen, D..A dist, D-H..A dist):
 #1.1/? THR 26 N       #1.1/A UNL 1 O      #1.1/? THR 26 H      3.515  2.716
 #1.1/? ASN 142 ND2    #1.1/A UNL 1 O      #1.1/? ASN 142 2HD2  3.227  2.305
 #1.1/A UNL 1 N        #1.1/? THR 26 O     #1.1/A UNL 1 H       3.463  2.652
 #1.2/A UNL 1 N        #1.2/? PHE 140 O    #1.2/A UNL 1 H       2.987  2.200
 #1.4/? THR 26 N       #1.4/A UNL 1 S      #1.4/? THR 26 H      4.354  3.371
 #1.4/? HIS 163 NE2    #1.4/A UNL 1 N     no hydrogen            3.137  N/A
 #1.4/A UNL 1 N        #1.4/? ARG 188 O    #1.4/A UNL 1 H       3.000  2.081
 #1.5/? HIS 163 NE2    #1.5/A UNL 1 N     no hydrogen            3.330  N/A
 #1.5/? GLN 189 NE2    #1.5/A UNL 1 O      #1.5/? GLN 189 2HE2  3.029  2.132
 #1.6/A UNL 1 N        #1.6/? ARG 188 O    #1.6/A UNL 1 H       2.984  2.064
 #1.8/? ASN 142 ND2    #1.8/A UNL 1 N      #1.8/? ASN 142 2HD2  3.164  2.395
 #1.8/? ASN 142 ND2    #1.8/A UNL 1 O      #1.8/? ASN 142 2HD2  3.031  2.180
 #1.8/? GLN 189 NE2    #1.8/A UNL 1 O      #1.8/? GLN 189 1HE2  3.276  2.553
 #1.8/A UNL 1 N        #1.8/? THR 190 O    #1.8/A UNL 1 H       3.257  2.407
 #1.9/A UNL 1 N        #1.9/? THR 190 O    #1.9/A UNL 1 H       2.913  2.037
 #1.10/? SER 144 OG    #1.10/A UNL 1 S     #1.10/? SER 144 HG   4.246  3.845
 #1.10/? HIS 163 NE2   #1.10/A UNL 1 S    no hydrogen            3.700  N/A
 #1.10/A UNL 1 N       #1.10/? THR 190 O   #1.10/A UNL 1 H      3.008  2.091
-----------------------------------------------------------------------------
5p9: 12 H-bonds
H-bonds (donor, acceptor, hydrogen, D..A dist, D-H..A dist):
 #1.1/? GLY 143 N      #1.1/A 5P9 1 O2    #1.1/? GLY 143 H      2.939  2.013
 #1.1/? CYS 145 SG     #1.1/A 5P9 1 N2    #1.1/? CYS 145 HG     3.678  2.679
 #1.1/? CYS 145 SG     #1.1/A 5P9 1 O2    #1.1/? CYS 145 HG     3.584  2.963
 #1.1/? HIS 163 NE2    #1.1/A 5P9 1 O1   no hydrogen            3.307  N/A
 #1.2/? ASN 142 ND2    #1.2/A 5P9 1 N2    #1.2/? ASN 142 2HD2   3.413  2.583
 #1.4/? ASN 142 ND2    #1.4/A 5P9 1 O2    #1.4/? ASN 142 2HD2   3.032  2.290
 #1.5/? GLN 189 NE2    #1.5/A 5P9 1 O1    #1.5/? GLN 189 1HE2   3.546  2.574
 #1.9/? GLY 143 N      #1.9/A 5P9 1 N2    #1.9/? GLY 143 H      3.241  2.345
 #1.9/? GLY 143 N      #1.9/A 5P9 1 O2    #1.9/? GLY 143 H      3.158  2.273
 #1.9/? GLN 189 NE2    #1.9/A 5P9 1 O1    #1.9/? GLN 189 1HE2   3.265  2.561
 #1.10/? ASN 142 ND2   #1.10/A 5P9 1 O2   #1.10/? ASN 142 2HD2  3.080  2.518
 #1.11/? ASN 142 ND2   #1.11/A 5P9 1 O2   #1.11/? ASN 142 1HD2  2.942  2.261
-----------------------------------------------------------------------------
Y6J: 19 H-bonds
H-bonds (donor, acceptor, hydrogen, D..A dist, D-H..A dist):
 #1.1/? SER 144 OG    #1.1/A UNL 1 S       #1.1/? SER 144 HG    4.242  3.841
 #1.1/? HIS 163 NE2   #1.1/A UNL 1 S      no hydrogen            3.869  N/A
 #1.1/? GLN 189 NE2   #1.1/A UNL 1 O       #1.1/? GLN 189 1HE2  3.192  2.191
 #1.1/? GLN 189 NE2   #1.1/A UNL 1 O       #1.1/? GLN 189 2HE2  3.129  2.463
 #1.2/? GLN 189 NE2   #1.2/A UNL 1 O       #1.2/? GLN 189 1HE2  3.244  2.245
 #1.2/? GLN 189 NE2   #1.2/A UNL 1 O       #1.2/? GLN 189 2HE2  3.145  2.414
 #1.3/? GLN 189 NE2   #1.3/A UNL 1 O       #1.3/? GLN 189 1HE2  2.980  2.036
 #1.4/? GLY 143 N     #1.4/A UNL 1 S       #1.4/? GLY 143 H     3.989  3.296
 #1.4/? SER 144 N     #1.4/A UNL 1 S       #1.4/? SER 144 H     3.910  3.194
 #1.4/? GLN 189 NE2   #1.4/A UNL 1 O       #1.4/? GLN 189 1HE2  3.153  2.331
 #1.5/? HIS 163 NE2   #1.5/A UNL 1 S      no hydrogen            3.901  N/A
 #1.5/? GLN 189 NE2   #1.5/A UNL 1 O       #1.5/? GLN 189 1HE2  3.161  2.580
 #1.5/A UNL 1 N       #1.5/? GLU 166 OE2   #1.5/A UNL 1 H       3.147  2.198
 #1.6/? GLY 143 N     #1.6/A UNL 1 N       #1.6/? GLY 143 H     3.145  2.243
 #1.6/? GLN 189 NE2   #1.6/A UNL 1 O       #1.6/? GLN 189 1HE2  2.985  2.119
 #1.6/A UNL 1 N       #1.6/? GLU 166 OE1   #1.6/A UNL 1 H       2.974  2.005
 #1.7/? GLY 143 N     #1.7/A UNL 1 S       #1.7/? GLY 143 H     3.841  2.976
 #1.8/A UNL 1 N       #1.8/? PHE 140 O     #1.8/A UNL 1 H       2.937  2.062
 #1.10/? GLY 143 N    #1.10/A UNL 1 O      #1.10/? GLY 143 H    3.182  2.150
-----------------------------------------------------------------------------
jacks18: 11 H-bonds
H-bonds (donor, acceptor, hydrogen, D..A dist, D-H..A dist):
 #1.1/? HIS 163 NE2    #1.1/A V1B 1 N3      no hydrogen            3.189  N/A
 #1.2/? ASN 142 ND2    #1.2/A V1B 1 O2       #1.2/? ASN 142 1HD2   3.089  2.515
 #1.4/? ASN 142 ND2    #1.4/A V1B 1 O2       #1.4/? ASN 142 2HD2   3.258  2.631
 #1.4/? GLY 143 N      #1.4/A V1B 1 N3       #1.4/? GLY 143 H      3.143  2.116
 #1.5/? GLN 189 NE2    #1.5/A V1B 1 O2       #1.5/? GLN 189 1HE2   3.087  2.354
 #1.6/? ASN 142 ND2    #1.6/A V1B 1 O2       #1.6/? ASN 142 2HD2   3.093  2.110
 #1.7/? GLN 189 NE2    #1.7/A V1B 1 O2       #1.7/? GLN 189 1HE2   3.031  2.322
 #1.7/A V1B 1 N1       #1.7/? GLU 166 OE1    #1.7/A V1B 1 H        2.983  2.094
 #1.9/? ASN 142 ND2    #1.9/A V1B 1 O1       #1.9/? ASN 142 2HD2   3.071  2.214
 #1.10/? ASN 142 ND2   #1.10/A V1B 1 O2      #1.10/? ASN 142 2HD2  3.108  2.171
 #1.11/A V1B 1 N1      #1.11/? GLU 166 OE2   #1.11/A V1B 1 H       3.355  2.549
-----------------------------------------------------------------------------

我需要在每个块中找到与树字母代码（如 ASN）后提到的数字（如 26、142、140 等）相对应的指定模式。基本上我需要获取有关每个块中模式首次出现的信息。如果在同一行检测到指定的数字，则预期输出应包括倒数第二列的值。例如。鉴于我输入了 4 个块，对于“142”，我应该获得：

061: the first occurence of 142 is (3.227). #142  found 3 times
5p9: the first occurence of 142 is (3.413). #142 found 4 times
Y6J: (). 142 found 0 times.
jacks18: the first occurence of 142 is (3.089). #142 found 5 times

我可以使用 sed 来识别所有包含该模式的行：

pattern='142'
sed -n "/${pattern}/p" input.log

或者我可以使用 sed 来指示字符串第一次出现的模式

sed -n "/${pattern}/p; /${pattern}/q" input.log

你可以建议我一些方法来使这些命令适应文件的多块结构并根据上面显示的模型打印块的名称和模式的出现吗？

Answer 1

假设：

块'name'总是出现在行的开头，不包含白色space，并以:
没有其他行以可能被错误地视为块的内容开头 'name'（即，没有其他行以 ^<alphanumeric>: 开头）

一个awk想法：

awk -v ptn="ASN 142" '                                           # define pattern to search for

function print_findings() {

    if (block)                                                   # if block is non-empty
       if (count)                                                # if count is non-zero
          printf "%s the first occurrence of %s is (%s). #%s found %d times\n", block, ptn, value, ptn, count
       else                                                      # else count=0
          printf "%s (). %s found 0 times.\n", block, ptn
}

 ~ /^[[:alnum:]]+:$/ { print_findings()                        # flush previous block details
                         block=                                # grab new block name
                         count=0                                 # reset
                         value=""                                # reset
                         next
                       }

[=10=] ~ ptn               { count++                                 # if we find "ASN" + ptn then increment counter and ...
                         value= (value == "") ? $(NF-1) : value  # save the value on the first matching line
                       }

END                    { print_findings() }                      # flush last block details
' input.log

注意： 删除注释以整理代码

对于 -v ptn="ASN 142" 这会生成：

06I: the first occurrence of 142 is (3.227). #142 found 3 times
5p9: the first occurrence of 142 is (3.413). #142 found 4 times
Y6J: (). 142 found 0 times.
jacks18: the first occurrence of 142 is (3.089). #142 found 5 times

对于 -v ptn="GLN 189" 这会生成：

06I: the first occurrence of GLN 189 is (3.029). #GLN 189 found 2 times
5p9: the first occurrence of GLN 189 is (3.546). #GLN 189 found 2 times
Y6J: the first occurrence of GLN 189 is (3.192). #GLN 189 found 8 times
jacks18: the first occurrence of GLN 189 is (3.087). #GLN 189 found 2 times

对于 -v ptn="GLU 166" 这会生成：

06I: (). GLU 166 found 0 times.
5p9: (). GLU 166 found 0 times.
Y6J: the first occurrence of GLU 166 is (3.147). #GLU 166 found 2 times
jacks18: the first occurrence of GLU 166 is (2.983). #GLU 166 found 2 times

sed：分段数据中的模式搜索

sed: pattern search in segmented data

bash

awk

sed