sed:删除日志文件中的重复模式

sed: removing dublicated patterns in the log file

我正在使用 post- 处理按以下格式排列的日志文件:

Finding intramodel H-bonds
Constraints relaxed by 0.55 angstroms and 20 degrees
Models used:
    1.1 SarsCov2_structure31R_nsp5holo_rep1.pdb
    1.6 SarsCov2_structure31R_nsp5holo_rep1.pdb
    1.10 SarsCov2_structure31R_nsp5holo_rep1.pdb
    1.8 SarsCov2_structure31R_nsp5holo_rep1.pdb
    1.2 SarsCov2_structure31R_nsp5holo_rep1.pdb
    1.3 SarsCov2_structure31R_nsp5holo_rep1.pdb
    1.4 SarsCov2_structure31R_nsp5holo_rep1.pdb
    1.7 SarsCov2_structure31R_nsp5holo_rep1.pdb
    1.5 SarsCov2_structure31R_nsp5holo_rep1.pdb
    1.9 SarsCov2_structure31R_nsp5holo_rep1.pdb

6 H-bonds
H-bonds (donor, acceptor, hydrogen, D..A dist, D-H..A dist):
SarsCov2_structure31R_nsp5holo_rep1.pdb #1.3/? ASN 142 ND2   SarsCov2_structure31R_nsp5holo_rep1.pdb #1.3/A UNL 1 N   SarsCov2_structure31R_nsp5holo_rep1.pdb #1.3/? ASN 142 2HD2   3.419  2.541
SarsCov2_structure31R_nsp5holo_rep1.pdb #1.5/? GLN 189 NE2   SarsCov2_structure31R_nsp5holo_rep1.pdb #1.5/A UNL 1 O   SarsCov2_structure31R_nsp5holo_rep1.pdb #1.5/? GLN 189 1HE2   2.883  2.159
SarsCov2_structure31R_nsp5holo_rep1.pdb #1.6/? HIS 163 NE2   SarsCov2_structure31R_nsp5holo_rep1.pdb #1.6/A UNL 1 O   no hydrogen  

从这个日志中,我需要取出第 3 行之后的所有行,然后删除所有重复的模式“SarsCov2_structure31R_nsp5holo_rep1.pdb”。我可以使用一些带有 sed 的正则表达式来检测日志中与这种模式匹配的任何短语(以 *.pdb 结尾)应该为每个处理过的日志自动删除吗? 所以预期的输出应该是:

Models used:
    1.1 
    1.6 
    1.10 
    1.8 
    1.2 
    1.3 
    1.4 
    1.7 
    1.5 
    1.9 

6 H-bonds
H-bonds (donor, acceptor, hydrogen, D..A dist, D-H..A dist):
 #1.3/? ASN 142 ND2    #1.3/A UNL 1 N    #1.3/? ASN 142 2HD2   3.419  2.541
 #1.5/? GLN 189 NE2    #1.5/A UNL 1 O    #1.5/? GLN 189 1HE2   2.883  2.159
 #1.6/? HIS 163 NE2    #1.6/A UNL 1 O   no hydrogen            3.299  N/A
 #1.7/? GLN 189 NE2    #1.7/A UNL 1 O    #1.7/? GLN 189 1HE2   3.109  2.147
 #1.9/? ASN 142 ND2    #1.9/A UNL 1 O    #1.9/? ASN 142 1HD2   3.032  2.319
 #1.10/? GLN 189 NE2   #1.10/A UNL 1 O   #1.10/? GLN 189 1HE2  3.054  2.125

这里有一些没有正则表达式的例子,它还不能工作:-)

cat test.log | tail -n +2 | sed -e "/SarsCov2_structure31R_nsp5holo_rep1.pdb/d" >> ./test2.log

您可以使用这个 sed:

sed -E '1,2d; s/[[:blank:]]*[^[:blank:]]+\.pdb//g' file

Models used:
    1.1
    1.6
    1.10
    1.8
    1.2
    1.3
    1.4
    1.7
    1.5
    1.9

6 H-bonds
H-bonds (donor, acceptor, hydrogen, D..A dist, D-H..A dist):
 #1.3/? ASN 142 ND2 #1.3/A UNL 1 N #1.3/? ASN 142 2HD2   3.419  2.541
 #1.5/? GLN 189 NE2 #1.5/A UNL 1 O #1.5/? GLN 189 1HE2   2.883  2.159
 #1.6/? HIS 163 NE2 #1.6/A UNL 1 O   no hydrogen

详情:

  • 1,2d: 删除前两行
  • s/[[:blank:]]*[^[:blank:]]+\.pdb//g:从全局的每一行中删除 0 个或多个 space 由 1+ 个非 space 字符后跟 .adb 填充的

使用sed

$ sed 's/[[:alnum:]_]*\.pdb//g;1,2d' input_file
Models used:
    1.1
    1.6
    1.10
    1.8
    1.2
    1.3
    1.4
    1.7
    1.5
    1.9

6 H-bonds
H-bonds (donor, acceptor, hydrogen, D..A dist, D-H..A dist):
 #1.3/? ASN 142 ND2    #1.3/A UNL 1 N    #1.3/? ASN 142 2HD2   3.419  2.541
 #1.5/? GLN 189 NE2    #1.5/A UNL 1 O    #1.5/? GLN 189 1HE2   2.883  2.159
 #1.6/? HIS 163 NE2    #1.6/A UNL 1 O   no hydrogen

使用您显示的示例,请尝试遵循 awk 代码。简单的解释是,如果 FNR>2 首先检查条件,然后只检查 运行 所有其他命令(在条件块内)。使用 gsub 的内部条件根据显示的示例将 [[:space:]]*SarsCov2_structure31R_nsp5holo_rep1\.pdb 全局替换为 NULL 并打印当前行。

awk '
FNR>2{
  gsub(/[[:space:]]*SarsCov2_structure31R_nsp5holo_rep1\.pdb/,"")
print
}
' Input_file