使用 sed 进行多模式匹配引导字符串替换

Multiple pattern matching guided string replacements with sed

File1 是包含蛋白质坐标的硬格式 pdb 文件:

ATOM      1  N   MET A   1     -37.809  27.446  34.618  1.00 43.34           N  
ATOM      2  CA  MET A   1     -37.480  26.307  33.746  1.00 43.34           C  
ATOM      3  C   MET A   1     -36.495  25.493  34.556  1.00 43.34           C  
ATOM      4  CB  MET A   1     -36.919  26.801  32.394  1.00 43.34           C  
ATOM      5  O   MET A   1     -35.346  25.898  34.661  1.00 43.34           O  
ATOM      6  CG  MET A   1     -36.980  25.729  31.301  1.00 43.34           C  
ATOM      7  SD  MET A   1     -35.977  26.080  29.826  1.00 43.34           S  
ATOM      8  CE  MET A   1     -36.833  27.479  29.055  1.00 43.34           C  
ATOM      9  N   GLU A   2     -36.991  24.516  35.314  1.00 37.48           N  
ATOM     10  CA  GLU A   2     -36.090  23.617  36.039  1.00 37.48           C  
ATOM     11  C   GLU A   2     -35.250  22.852  35.010  1.00 37.48           C  
ATOM     12  CB  GLU A   2     -36.860  22.659  36.957  1.00 37.48           C  
ATOM     13  O   GLU A   2     -35.776  22.534  33.938  1.00 37.48           O  
ATOM     14  CG  GLU A   2     -37.467  23.407  38.153  1.00 37.48           C 
..............................................................................
..............................................................................
..............................................................................
ATOM    981  N   CYS A 123     -15.659  -7.164  13.998  1.00 90.53           N  
ATOM    982  CA  CYS A 123     -16.801  -7.332  13.106  1.00 90.53           C  
ATOM    983  C   CYS A 123     -17.894  -8.234  13.699  1.00 90.53           C  
ATOM    984  CB  CYS A 123     -16.321  -7.886  11.757  1.00 90.53           C  
ATOM    985  O   CYS A 123     -18.918  -8.425  13.046  1.00 90.53           O  
ATOM    986  SG  CYS A 123     -15.266  -6.683  10.904  1.00 90.53           S  
ATOM    987  N   GLY A 124     -17.679  -8.840  14.874  1.00 90.37           N  
ATOM    988  CA  GLY A 124     -18.641  -9.764  15.474  1.00 90.37           C  
ATOM    989  C   GLY A 124     -18.851 -11.029  14.637  1.00 90.37           C  
ATOM    990  O   GLY A 124     -19.970 -11.514  14.513  1.00 90.37           O  
ATOM    991  N   SER A 125     -17.793 -11.536  13.996  1.00 92.09           N  
ATOM    992  CA  SER A 125     -17.837 -12.749  13.159  1.00 92.09           C  
ATOM    993  C   SER A 125     -17.220 -13.976  13.833  1.00 92.09           C  
ATOM    994  CB  SER A 125     -17.117 -12.481  11.840  1.00 92.09           C  
ATOM    995  O   SER A 125     -17.538 -15.108  13.459  1.00 92.09           O  
ATOM    996  OG  SER A 125     -17.831 -11.523  11.084  1.00 92.09           O 
....................... plus many more lines ................................. 

File2是从上面的字段3,4,5得到的代表行列表 .pdb 文件。为了简单起见,我们只考虑行:

GLU A   2
GLY A 124

期望的输出是:

ATOM      1  N   MET A   1     -37.809  27.446  34.618  1.00 43.34           N  
ATOM      2  CA  MET A   1     -37.480  26.307  33.746  1.00 43.34           C  
ATOM      3  C   MET A   1     -36.495  25.493  34.556  1.00 43.34           C  
ATOM      4  CB  MET A   1     -36.919  26.801  32.394  1.00 43.34           C  
ATOM      5  O   MET A   1     -35.346  25.898  34.661  1.00 43.34           O  
ATOM      6  CG  MET A   1     -36.980  25.729  31.301  1.00 43.34           C  
ATOM      7  SD  MET A   1     -35.977  26.080  29.826  1.00 43.34           S  
ATOM      8  CE  MET A   1     -36.833  27.479  29.055  1.00 43.34           C  
ATOM      9  N   GLU A   2     -36.991  24.516  35.314  1.00 00.00           N  
ATOM     10  CA  GLU A   2     -36.090  23.617  36.039  1.00 00.00           C  
ATOM     11  C   GLU A   2     -35.250  22.852  35.010  1.00 00.00           C  
ATOM     12  CB  GLU A   2     -36.860  22.659  36.957  1.00 00.00           C  
ATOM     13  O   GLU A   2     -35.776  22.534  33.938  1.00 00.00           O  
ATOM     14  CG  GLU A   2     -37.467  23.407  38.153  1.00 00.00           C 
..............................................................................
..............................................................................
..............................................................................
ATOM    981  N   CYS A 123     -15.659  -7.164  13.998  1.00 90.53           N  
ATOM    982  CA  CYS A 123     -16.801  -7.332  13.106  1.00 90.53           C  
ATOM    983  C   CYS A 123     -17.894  -8.234  13.699  1.00 90.53           C  
ATOM    984  CB  CYS A 123     -16.321  -7.886  11.757  1.00 90.53           C  
ATOM    985  O   CYS A 123     -18.918  -8.425  13.046  1.00 90.53           O  
ATOM    986  SG  CYS A 123     -15.266  -6.683  10.904  1.00 90.53           S  
ATOM    987  N   GLY A 124     -17.679  -8.840  14.874  1.00 00.00           N  
ATOM    988  CA  GLY A 124     -18.641  -9.764  15.474  1.00 00.00           C  
ATOM    989  C   GLY A 124     -18.851 -11.029  14.637  1.00 00.00           C  
ATOM    990  O   GLY A 124     -19.970 -11.514  14.513  1.00 00.00           O  
ATOM    991  N   SER A 125     -17.793 -11.536  13.996  1.00 92.09           N  
ATOM    992  CA  SER A 125     -17.837 -12.749  13.159  1.00 92.09           C  
ATOM    993  C   SER A 125     -17.220 -13.976  13.833  1.00 92.09           C  
ATOM    994  CB  SER A 125     -17.117 -12.481  11.840  1.00 92.09           C  
ATOM    995  O   SER A 125     -17.538 -15.108  13.459  1.00 92.09           O  
ATOM    996  OG  SER A 125     -17.831 -11.523  11.084  1.00 92.09           O 

即如果 File1 的行包含 文件 2 出现。

我已经知道如何使用 Bash while-read 和 awk 来做到这一点,但是因为这些工具 更改格式并要求重新格式化 and/or 指定输出格式,在此 处理数百个文件的特殊情况它们是不实用的。 为了避免这些问题我决定寻找基于sed的解决方案。 如果我明确给出一个单一的搜索模式,我就会得到一个可行的解决方案。即 以下代码有效:

digits=00.00
sed "/GLU A   2/s/\(.\{61\}\)\(.\{5\}\)/$digits/" File1.pdb  > out.pdb

但下面没有(File1 行未更改),我没有管理 找出原因:

digits=00.00
while read pattern; do 
    sed "/$pattern/s/\(.\{61\}\)\(.\{5\}\)/$digits/" File1.pdb > out.pdb ;
done < File2.txt

对于冗长的消息,我们深表歉意。在此先感谢您的帮助。

@anubhava:

使用我的真实数据,这是第一个替换站点发生的情况:

ATOM    293  CE1 HIS A  38     -18.278  19.735  13.486  1.00 67.94           C  
ATOM    294  NE2 HIS A  38     -18.518  18.594  14.144  1.00 67.94           N  
ATOM    295  N   GLY A  39     -13.836  00.00   9.206  1.00 71.50           N  
ATOM    296  CA  GLY A  39     -12.628  00.00   8.447  1.00 71.50           C  
ATOM    297  C   GLY A  39     -11.358  00.00   9.286  1.00 71.50           C  
ATOM    298  O   GLY A  39     -11.411  18.636  10.344  1.00 00.00           O  
ATOM    299  N   PRO A  40     -10.180  17.577   8.797  1.00 71.93           N  
ATOM    300  CA  PRO A  40      -8.908  17.719   9.520  1.00 71.93           C  
ATOM    301  C   PRO A  40      -8.580  19.169   9.912  1.00 71.93           C  

在这种情况下,网站是/GLY A 39/。如您所见,第 8 个字段中的某些行和不需要的替换发生了变化。 奇怪的是,这样的问题只发生在第一次更换时,即 remaning 输出非常完美。谢谢。

awk更适合这个角色:

awk 'FNR==NR {a[,,]; next} (,,) in a {="00.00"} 1' file2 file1 | column -t

ATOM  1    N   MET  A  1    -37.809  27.446   34.618  1.00  43.34  N
ATOM  2    CA  MET  A  1    -37.480  26.307   33.746  1.00  43.34  C
ATOM  3    C   MET  A  1    -36.495  25.493   34.556  1.00  43.34  C
ATOM  4    CB  MET  A  1    -36.919  26.801   32.394  1.00  43.34  C
ATOM  5    O   MET  A  1    -35.346  25.898   34.661  1.00  43.34  O
ATOM  6    CG  MET  A  1    -36.980  25.729   31.301  1.00  43.34  C
ATOM  7    SD  MET  A  1    -35.977  26.080   29.826  1.00  43.34  S
ATOM  8    CE  MET  A  1    -36.833  27.479   29.055  1.00  43.34  C
ATOM  9    N   GLU  A  2    -36.991  24.516   35.314  1.00  00.00  N
ATOM  10   CA  GLU  A  2    -36.090  23.617   36.039  1.00  00.00  C
ATOM  11   C   GLU  A  2    -35.250  22.852   35.010  1.00  00.00  C
ATOM  12   CB  GLU  A  2    -36.860  22.659   36.957  1.00  00.00  C
ATOM  13   O   GLU  A  2    -35.776  22.534   33.938  1.00  00.00  O
ATOM  14   CG  GLU  A  2    -37.467  23.407   38.153  1.00  00.00  C
ATOM  981  N   CYS  A  123  -15.659  -7.164   13.998  1.00  90.53  N
ATOM  982  CA  CYS  A  123  -16.801  -7.332   13.106  1.00  90.53  C
ATOM  983  C   CYS  A  123  -17.894  -8.234   13.699  1.00  90.53  C
ATOM  984  CB  CYS  A  123  -16.321  -7.886   11.757  1.00  90.53  C
ATOM  985  O   CYS  A  123  -18.918  -8.425   13.046  1.00  90.53  O
ATOM  986  SG  CYS  A  123  -15.266  -6.683   10.904  1.00  90.53  S
ATOM  987  N   GLY  A  124  -17.679  -8.840   14.874  1.00  00.00  N
ATOM  988  CA  GLY  A  124  -18.641  -9.764   15.474  1.00  00.00  C
ATOM  989  C   GLY  A  124  -18.851  -11.029  14.637  1.00  00.00  C
ATOM  990  O   GLY  A  124  -19.970  -11.514  14.513  1.00  00.00  O
ATOM  991  N   SER  A  125  -17.793  -11.536  13.996  1.00  92.09  N
ATOM  992  CA  SER  A  125  -17.837  -12.749  13.159  1.00  92.09  C
ATOM  993  C   SER  A  125  -17.220  -13.976  13.833  1.00  92.09  C
ATOM  994  CB  SER  A  125  -17.117  -12.481  11.840  1.00  92.09  C
ATOM  995  O   SER  A  125  -17.538  -15.108  13.459  1.00  92.09  O
ATOM  996  OG  SER  A  125  -17.831  -11.523  11.084  1.00  92.09  O

仅用于 column -t 表格输出显示。

在逐行读取文件 2 的 while loop 中使用 sed,您可以仅定位与文件 2 中找到的行相匹配的行,并在这些行上执行子操作;

s/\(.*\)[0-9]\{2\}\.[0-9]\{2\}\([[:space:]]\+.*$\)/ - 将所有内容分组到与模式匹配的最后一位数字,并保留 return 后向引用 </code>。排除模式中匹配的数字,并再次将 space 和 return 与 back-reference <code>

之后的所有其他内容分组
$ cat file1
ATOM      1  N   MET A   1     -37.809  27.446  34.618  1.00 43.34           N
ATOM      2  CA  MET A   1     -37.480  26.307  33.746  1.00 43.34           C
ATOM      3  C   MET A   1     -36.495  25.493  34.556  1.00 43.34           C
ATOM      4  CB  MET A   1     -36.919  26.801  32.394  1.00 43.34           C
ATOM      5  O   MET A   1     -35.346  25.898  34.661  1.00 43.34           O
ATOM      6  CG  MET A   1     -36.980  25.729  31.301  1.00 43.34           C
ATOM      7  SD  MET A   1     -35.977  26.080  29.826  1.00 43.34           S
ATOM      8  CE  MET A   1     -36.833  27.479  29.055  1.00 43.34           C
ATOM      9  N   GLU A   2     -36.991  24.516  35.314  1.00 37.48           N
ATOM     10  CA  GLU A   2     -36.090  23.617  36.039  1.00 37.48           C
ATOM     11  C   GLU A   2     -35.250  22.852  35.010  1.00 37.48           C
ATOM     12  CB  GLU A   2     -36.860  22.659  36.957  1.00 37.48           C
ATOM     13  O   GLU A   2     -35.776  22.534  33.938  1.00 37.48           O
ATOM     14  CG  GLU A   2     -37.467  23.407  38.153  1.00 37.48           C
ATOM    981  N   CYS A 123     -15.659  -7.164  13.998  1.00 90.53           N
ATOM    982  CA  CYS A 123     -16.801  -7.332  13.106  1.00 90.53           C
ATOM    983  C   CYS A 123     -17.894  -8.234  13.699  1.00 90.53           C
ATOM    984  CB  CYS A 123     -16.321  -7.886  11.757  1.00 90.53           C
ATOM    985  O   CYS A 123     -18.918  -8.425  13.046  1.00 90.53           O
ATOM    986  SG  CYS A 123     -15.266  -6.683  10.904  1.00 90.53           S
ATOM    987  N   GLY A 124     -17.679  -8.840  14.874  1.00 90.37           N
ATOM    988  CA  GLY A 124     -18.641  -9.764  15.474  1.00 90.37           C
ATOM    989  C   GLY A 124     -18.851 -11.029  14.637  1.00 90.37           C
ATOM    990  O   GLY A 124     -19.970 -11.514  14.513  1.00 90.37           O
ATOM    991  N   SER A 125     -17.793 -11.536  13.996  1.00 92.09           N
ATOM    992  CA  SER A 125     -17.837 -12.749  13.159  1.00 92.09           C
ATOM    993  C   SER A 125     -17.220 -13.976  13.833  1.00 92.09           C
ATOM    994  CB  SER A 125     -17.117 -12.481  11.840  1.00 92.09           C
ATOM    995  O   SER A 125     -17.538 -15.108  13.459  1.00 92.09           O
ATOM    996  OG  SER A 125     -17.831 -11.523  11.084  1.00 92.09           O
$ while read -r line; do sed -i.bak "/$line/s/\(.*\)[0-9]\{2\}\.[0-9]\{2\}\([[:space:]]\+.*$\)/0.00/" file1; done < file2
$ cat file1
ATOM      1  N   MET A   1     -37.809  27.446  34.618  1.00 43.34           N
ATOM      2  CA  MET A   1     -37.480  26.307  33.746  1.00 43.34           C
ATOM      3  C   MET A   1     -36.495  25.493  34.556  1.00 43.34           C
ATOM      4  CB  MET A   1     -36.919  26.801  32.394  1.00 43.34           C
ATOM      5  O   MET A   1     -35.346  25.898  34.661  1.00 43.34           O
ATOM      6  CG  MET A   1     -36.980  25.729  31.301  1.00 43.34           C
ATOM      7  SD  MET A   1     -35.977  26.080  29.826  1.00 43.34           S
ATOM      8  CE  MET A   1     -36.833  27.479  29.055  1.00 43.34           C
ATOM      9  N   GLU A   2     -36.991  24.516  35.314  1.00 00.00           N
ATOM     10  CA  GLU A   2     -36.090  23.617  36.039  1.00 00.00           C
ATOM     11  C   GLU A   2     -35.250  22.852  35.010  1.00 00.00           C
ATOM     12  CB  GLU A   2     -36.860  22.659  36.957  1.00 00.00           C
ATOM     13  O   GLU A   2     -35.776  22.534  33.938  1.00 00.00           O
ATOM     14  CG  GLU A   2     -37.467  23.407  38.153  1.00 00.00           C
ATOM    981  N   CYS A 123     -15.659  -7.164  13.998  1.00 90.53           N
ATOM    982  CA  CYS A 123     -16.801  -7.332  13.106  1.00 90.53           C
ATOM    983  C   CYS A 123     -17.894  -8.234  13.699  1.00 90.53           C
ATOM    984  CB  CYS A 123     -16.321  -7.886  11.757  1.00 90.53           C
ATOM    985  O   CYS A 123     -18.918  -8.425  13.046  1.00 90.53           O
ATOM    986  SG  CYS A 123     -15.266  -6.683  10.904  1.00 90.53           S
ATOM    987  N   GLY A 124     -17.679  -8.840  14.874  1.00 00.00           N
ATOM    988  CA  GLY A 124     -18.641  -9.764  15.474  1.00 00.00           C
ATOM    989  C   GLY A 124     -18.851 -11.029  14.637  1.00 00.00           C
ATOM    990  O   GLY A 124     -19.970 -11.514  14.513  1.00 00.00           O
ATOM    991  N   SER A 125     -17.793 -11.536  13.996  1.00 92.09           N
ATOM    992  CA  SER A 125     -17.837 -12.749  13.159  1.00 92.09           C
ATOM    993  C   SER A 125     -17.220 -13.976  13.833  1.00 92.09           C
ATOM    994  CB  SER A 125     -17.117 -12.481  11.840  1.00 92.09           C
ATOM    995  O   SER A 125     -17.538 -15.108  13.459  1.00 92.09           O
ATOM    996  OG  SER A 125     -17.831 -11.523  11.084  1.00 92.09           O