使用 grep(或 awk)提取两个连续的单词
using grep (or awk) to extract two consecutive words
我有一个 BAM 文件,我需要从中提取以“PL:Z...”开头的两个词,然后是“PR:Z...”
我开始尝试第一个词,但运气不好:
samtools view -h file1.bam | grep -o '\<PR[[:alnum:]]+\>'
用 awk 提取列会更容易,但是,我观察到文件中所有行的 PL 和 PR 的列号并不一致
awk -v OFS='\t' '{print , }'
前 3 行的测试文件:
MN01111:72:000H3TTKV:1:13108:10015:1913 2689 SL3.0ch00 8990677 0 59H40M52H SL3.0ch01 5122725 0 TTTTTTTTTTTTTTATTATTTTTTTTTATTTTTTTTTTTT AFFFF/FF/FFFF//FF////A/FFFF///F/FF////F/ NM:i:2 MD:Z:5A24A9 MC:Z:122M29H AS:i:30 XS:i:28 SA:Z:SL3.0ch09,55182541,-,78S31M42S,0,0; XA:Z:SL3.0ch05,+4984944,78S33M40S,1;SL3.0ch09,-70510420,47S27M77S,0;SL3.0ch02,-52101716,44S37M70S,2;SL3.0ch08,+62573290,63S25M63S,0; bl:Z:CGATGT br:Z:TTTGTC bm:Z:0 PL:Z:SL3.0ch01_5122724_5122846_FW PR:Z:None RG:Z:000H3TTKV_1_BSPT19472_0
MN01111:72:000H3TTKV:1:23103:5003:15527 641 SL3.0ch00 8990677 19 67S40M44S SL3.0ch01 838549 0 CCGCTCCCCCGATCCCTTCCACCCGGTCCTTATTTTTTTTTTTTTTTTTTTTTTTTTTTATATTTTTTTTTTATTTTTTTTATTATTTTTTTTTATTTTTTTTTTTTTTTTTTTTTTTATTTTTTTTTTTTTTTCTTTTATATTTTTGCCC ////////6=///=///////==////////F=//FA/F/F/6//6FFF/FFFF=/F//F///FF/FAFF///F//FFF/F/FF//FF/FAAAFAFFAFA///AFFFFFF/FFAF/A///6/F///F///6/F////FF////FF///FFF NM:i:1 MD:Z:30A9 MC:Z:105S35M11S AS:i:35 XS:i:31 SA:Z:SL3.0ch02,46044972,+,28S31M92S,0,0; XA:Z:SL3.0ch09,-70510416,35S31M85S,0; bl:Z:ATCACG br:Z:GTGCCT bm:Z:0 PL:Z:SL3.0ch05_3501697_3501846_FW PR:Z:None RG:Z:000H3TTKV_1_Fimande_0
MN01111:72:000H3TTKV:1:23110:15540:17389 2689 SL3.0ch00 8990677 0 10H40M101H SL3.0ch02 39003136 0 TTTTTTTTTTTTTTTTTATTTTTTTTTATTATTTTTTTTT F==AFFFA6FAF//F////A/F/F=///////////A/FA NM:i:2 MD:Z:5A8A25 MC:Z:151M AS:i:30 XS:i:29 SA:Z:SL3.0ch03,30054271,+,44S32M75S,0,0;SL3.0ch12,17846152,-,40S30M81S,0,0; bl:Z:ATCACG br:Z:ACCATG bm:Z:0 PL:Z:SL3.0ch02_39003135_39003329_FW PR:Z:None RG:Z:000H3TTKV_1_Martyvel_0
预期输出:
PL:Z:SL3.0ch01_5122724_5122846_FW PR:Z:None
PL:Z:SL3.0ch05_3501697_3501846_FW PR:Z:None
PL:Z:SL3.0ch02_39003135_39003329_FW PR:Z:None
您可以使用此 awk
循环遍历所有字段并使用正则表达式匹配字段 ^P[LR]:Z:
并将其附加到变量中以在最后打印它。
awk -v OFS='\t' '
{
s = ""
for (i=1; i<=NF; ++i)
if ($i ~ /^P[LR]:Z:/)
s = (s ? s OFS : "") $i
print s
}' file
PL:Z:SL3.0ch01_5122724_5122846_FW PR:Z:None
PL:Z:SL3.0ch05_3501697_3501846_FW PR:Z:None
PL:Z:SL3.0ch02_39003135_39003329_FW PR:Z:None
我会利用 String Functions match
和 substr
来完成这项任务
samtools view -h file1.bam | awk 'match([=10=],/PL:Z:.*PR:Z:[^[:space:]]+/){print substr([=10=],RSTART,RLENGTH)}'
这给出了
PL:Z:SL3.0ch01_5122724_5122846_FW PR:Z:None
PL:Z:SL3.0ch05_3501697_3501846_FW PR:Z:None
PL:Z:SL3.0ch02_39003135_39003329_FW PR:Z:None
说明:使用 match
查找 PL:Z:
后跟任意字符 (.
) 的零个或多个 (*
) 后跟 PR:Z:
后跟一个或多个字符 (+
),这些字符不是 (^
) 空格 ([:space:]
)。如果有匹配,则从匹配开始的地方开始打印子字符串,并且与匹配一样长,或者简单地说打印匹配的内容。
(在 gawk 4.2.1 中测试)
如果 sed 是一个选项,它可以像这样进行那种替换:
samtools view -h file1.bam | sed 's/.*\(PL:Z:.*PR:Z:\w*\).*//g'
输出:
PL:Z:SL3.0ch01_5122724_5122846_FW PR:Z:None
PL:Z:SL3.0ch05_3501697_3501846_FW PR:Z:None
PL:Z:SL3.0ch02_39003135_39003329_FW PR:Z:None
解释:
s/pattern/replacement/g
将替换每行中所有出现的模式。
pattern
是:
.* = any characters (except newlines)
\( = start of capture group 1
PL:Z: = literal characters
.* = any characters
PR:Z: = literal characters
\w* = any word characters (e.g. non-spaces)
\) = end of capture group 1
.* = any characters
replacement
是
= 在模式的捕获组 1 中捕获的内容。
请注意,此简单版本还将打印不包含 PL:Z: 和 PR:Z:
的完整行
我有一个 BAM 文件,我需要从中提取以“PL:Z...”开头的两个词,然后是“PR:Z...”
我开始尝试第一个词,但运气不好:
samtools view -h file1.bam | grep -o '\<PR[[:alnum:]]+\>'
用 awk 提取列会更容易,但是,我观察到文件中所有行的 PL 和 PR 的列号并不一致
awk -v OFS='\t' '{print , }'
前 3 行的测试文件:
MN01111:72:000H3TTKV:1:13108:10015:1913 2689 SL3.0ch00 8990677 0 59H40M52H SL3.0ch01 5122725 0 TTTTTTTTTTTTTTATTATTTTTTTTTATTTTTTTTTTTT AFFFF/FF/FFFF//FF////A/FFFF///F/FF////F/ NM:i:2 MD:Z:5A24A9 MC:Z:122M29H AS:i:30 XS:i:28 SA:Z:SL3.0ch09,55182541,-,78S31M42S,0,0; XA:Z:SL3.0ch05,+4984944,78S33M40S,1;SL3.0ch09,-70510420,47S27M77S,0;SL3.0ch02,-52101716,44S37M70S,2;SL3.0ch08,+62573290,63S25M63S,0; bl:Z:CGATGT br:Z:TTTGTC bm:Z:0 PL:Z:SL3.0ch01_5122724_5122846_FW PR:Z:None RG:Z:000H3TTKV_1_BSPT19472_0
MN01111:72:000H3TTKV:1:23103:5003:15527 641 SL3.0ch00 8990677 19 67S40M44S SL3.0ch01 838549 0 CCGCTCCCCCGATCCCTTCCACCCGGTCCTTATTTTTTTTTTTTTTTTTTTTTTTTTTTATATTTTTTTTTTATTTTTTTTATTATTTTTTTTTATTTTTTTTTTTTTTTTTTTTTTTATTTTTTTTTTTTTTTCTTTTATATTTTTGCCC ////////6=///=///////==////////F=//FA/F/F/6//6FFF/FFFF=/F//F///FF/FAFF///F//FFF/F/FF//FF/FAAAFAFFAFA///AFFFFFF/FFAF/A///6/F///F///6/F////FF////FF///FFF NM:i:1 MD:Z:30A9 MC:Z:105S35M11S AS:i:35 XS:i:31 SA:Z:SL3.0ch02,46044972,+,28S31M92S,0,0; XA:Z:SL3.0ch09,-70510416,35S31M85S,0; bl:Z:ATCACG br:Z:GTGCCT bm:Z:0 PL:Z:SL3.0ch05_3501697_3501846_FW PR:Z:None RG:Z:000H3TTKV_1_Fimande_0
MN01111:72:000H3TTKV:1:23110:15540:17389 2689 SL3.0ch00 8990677 0 10H40M101H SL3.0ch02 39003136 0 TTTTTTTTTTTTTTTTTATTTTTTTTTATTATTTTTTTTT F==AFFFA6FAF//F////A/F/F=///////////A/FA NM:i:2 MD:Z:5A8A25 MC:Z:151M AS:i:30 XS:i:29 SA:Z:SL3.0ch03,30054271,+,44S32M75S,0,0;SL3.0ch12,17846152,-,40S30M81S,0,0; bl:Z:ATCACG br:Z:ACCATG bm:Z:0 PL:Z:SL3.0ch02_39003135_39003329_FW PR:Z:None RG:Z:000H3TTKV_1_Martyvel_0
预期输出:
PL:Z:SL3.0ch01_5122724_5122846_FW PR:Z:None
PL:Z:SL3.0ch05_3501697_3501846_FW PR:Z:None
PL:Z:SL3.0ch02_39003135_39003329_FW PR:Z:None
您可以使用此 awk
循环遍历所有字段并使用正则表达式匹配字段 ^P[LR]:Z:
并将其附加到变量中以在最后打印它。
awk -v OFS='\t' '
{
s = ""
for (i=1; i<=NF; ++i)
if ($i ~ /^P[LR]:Z:/)
s = (s ? s OFS : "") $i
print s
}' file
PL:Z:SL3.0ch01_5122724_5122846_FW PR:Z:None
PL:Z:SL3.0ch05_3501697_3501846_FW PR:Z:None
PL:Z:SL3.0ch02_39003135_39003329_FW PR:Z:None
我会利用 String Functions match
和 substr
来完成这项任务
samtools view -h file1.bam | awk 'match([=10=],/PL:Z:.*PR:Z:[^[:space:]]+/){print substr([=10=],RSTART,RLENGTH)}'
这给出了
PL:Z:SL3.0ch01_5122724_5122846_FW PR:Z:None
PL:Z:SL3.0ch05_3501697_3501846_FW PR:Z:None
PL:Z:SL3.0ch02_39003135_39003329_FW PR:Z:None
说明:使用 match
查找 PL:Z:
后跟任意字符 (.
) 的零个或多个 (*
) 后跟 PR:Z:
后跟一个或多个字符 (+
),这些字符不是 (^
) 空格 ([:space:]
)。如果有匹配,则从匹配开始的地方开始打印子字符串,并且与匹配一样长,或者简单地说打印匹配的内容。
(在 gawk 4.2.1 中测试)
如果 sed 是一个选项,它可以像这样进行那种替换:
samtools view -h file1.bam | sed 's/.*\(PL:Z:.*PR:Z:\w*\).*//g'
输出:
PL:Z:SL3.0ch01_5122724_5122846_FW PR:Z:None
PL:Z:SL3.0ch05_3501697_3501846_FW PR:Z:None
PL:Z:SL3.0ch02_39003135_39003329_FW PR:Z:None
解释:
s/pattern/replacement/g
将替换每行中所有出现的模式。
pattern
是:
.* = any characters (except newlines)
\( = start of capture group 1
PL:Z: = literal characters
.* = any characters
PR:Z: = literal characters
\w* = any word characters (e.g. non-spaces)
\) = end of capture group 1
.* = any characters
replacement
是 = 在模式的捕获组 1 中捕获的内容。
请注意,此简单版本还将打印不包含 PL:Z: 和 PR:Z:
的完整行