匹配特定模式并仅打印上一行中匹配的字符串
Match specific pattern and print just the matched string in the previous line
我用附加信息更新问题
我有一个按以下方式格式化的 .fastq 文件
@M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8 (sequence name)
CATCTACATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC.. (sequence)
+
ACCCGGGGGGGGGDGGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFF.. (sequence quality)
对于每个序列,格式都是相同的(重复 4 行)
我想做的是在 n=35 个字符的 window 中搜索特定的正则表达式模式 ([A-Z]{5,}ACA[A-Z]{5,}ACA[A-Z]{5,})第 2 行的,如果找到它,则将其剪切并在上一行的末尾报告。
到目前为止,我已经编写了一堆代码,它们几乎完成了我 want.I 的想法,将匹配函数与我感兴趣的 window 的子字符串一起使用,但我没有实现我的目标目标。我在下面报告 script.awk :
match(substr([=11=],0,35),/regexp/,a) {
print p,a[0] #print the previous line respect to the matched one
print #print the current line
for(i=0;i<=1;i++) { # print the 2 lines following
getline
print
}
}#store previous line
{ p = [=11=] }
从这样的文件开始:
@M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8
AACATCTACATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC..
+
GGGGGGGGDGGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFGGGGFGF..
我想获得这样的输出:
@M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8 TATTCACATATAGACATGAAA #is the string that matched the regexp WITHOUT initial AA that doesn' match my expression
ATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC #without initial AA
+
GGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFGGGGFGF # without "GGGGGGGGDGGGFGGGGGGFGGG" that is the same number of characters removed in the 2nd line
我警告你,我想找点乐子,结果很扭曲。
awk -v pattern=pattern -v window=15 '
BEGIN{RS="@";FS=OFS="\n"}
{pos = match(, pattern); n_del=pos+length(pattern)}
pos && (n_del<=window){ = " " pattern; =substr(, n_del); =substr(, n_del)}
NR!=1{printf "%s%s", RS, [=10=]}
' file
输入:
@M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8
CATCTACpatternATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC..
+
ACCCGGGGGGGGGDGGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFGGGGFGF..
@M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8
CATCTACGCpatternATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC..
+
ACCCGGGGDGGGGGGDGGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFGGGGFGF..
输出:
@M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8 pattern
ATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC..
+
GGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFGGGGFGF..
@M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8
CATCTACGCpatternATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC..
+
ACCCGGGGDGGGGGGDGGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFGGGGFGF..
第二个块未更新,因为 window 是 15,它无法在此 window 中找到模式。
我使用变量 RS
来处理整个 4 行块 [=14=]
、</code>、<code>
、</code> 和 <code>
.因为输入文件以 RS
开头并且不以 RS
结尾,所以我宁愿不设置 ORS
并使用 printf
而不是 print
.
$ cat tst.awk
BEGIN {
tgtStr = "pattern"
tgtLgth = length(tgtStr)
winLgth = 35
numLines = 4
}
{
lineNr = ( (NR-1) % numLines ) + 1
rec[lineNr] = [=10=]
}
lineNr == numLines {
if ( idx = index(substr(rec[2],1,winLgth),tgtStr) ) {
rec[1] = rec[1] " " tgtStr
rec[2] = substr(rec[2],idx+tgtLgth)
rec[4] = substr(rec[4],idx+tgtLgth)
}
for ( lineNr=1; lineNr<=numLines; lineNr++ ) {
print rec[lineNr]
}
}
$ awk -f tst.awk file
@M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8 pattern
ATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC..
+
GGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFGGGGFGF..
根据您发布的代码:
substr([=12=],0,35)
- awk 中的字符串、字段、行号和数组从 1 而不是 0 开始,因此应该是 substr([=13=],1,35)
。 awk 会补偿你的错误,并把它当作你在这种情况下写了 1 而不是 0,但习惯于在 1
开始一切,以避免在重要的时候出错。
for(i=0;i<=1;i++)
- 出于同样的原因应该是 for(i=1;i<=2;i++)
。
getline
- 用法不当且语法脆弱,请参阅 for(i=0;i<=1;i++)
更新 - 根据您在下面的评论,pattern
实际上是一个正则表达式而不是字符串:
$ cat tst.awk
BEGIN {
tgtRegexp = "[A-Z]{5,}ACA[A-Z]{5,}ACA[A-Z]{5,}"
winLgth = 35
numLines = 4
}
{
lineNr = ( (NR-1) % numLines ) + 1
rec[lineNr] = [=11=]
}
lineNr == numLines {
if ( match(substr(rec[2],1,winLgth),tgtRegexp) ) {
rec[1] = rec[1] " " substr(rec[2],RSTART,RLENGTH)
rec[2] = substr(rec[2],RSTART+RLENGTH)
rec[4] = substr(rec[4],RSTART+RLENGTH)
}
for ( lineNr=1; lineNr<=numLines; lineNr++ ) {
print rec[lineNr]
}
}
我用附加信息更新问题
我有一个按以下方式格式化的 .fastq 文件
@M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8 (sequence name)
CATCTACATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC.. (sequence)
+
ACCCGGGGGGGGGDGGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFF.. (sequence quality)
对于每个序列,格式都是相同的(重复 4 行) 我想做的是在 n=35 个字符的 window 中搜索特定的正则表达式模式 ([A-Z]{5,}ACA[A-Z]{5,}ACA[A-Z]{5,})第 2 行的,如果找到它,则将其剪切并在上一行的末尾报告。
到目前为止,我已经编写了一堆代码,它们几乎完成了我 want.I 的想法,将匹配函数与我感兴趣的 window 的子字符串一起使用,但我没有实现我的目标目标。我在下面报告 script.awk :
match(substr([=11=],0,35),/regexp/,a) {
print p,a[0] #print the previous line respect to the matched one
print #print the current line
for(i=0;i<=1;i++) { # print the 2 lines following
getline
print
}
}#store previous line
{ p = [=11=] }
从这样的文件开始:
@M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8
AACATCTACATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC..
+
GGGGGGGGDGGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFGGGGFGF..
我想获得这样的输出:
@M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8 TATTCACATATAGACATGAAA #is the string that matched the regexp WITHOUT initial AA that doesn' match my expression
ATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC #without initial AA
+
GGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFGGGGFGF # without "GGGGGGGGDGGGFGGGGGGFGGG" that is the same number of characters removed in the 2nd line
我警告你,我想找点乐子,结果很扭曲。
awk -v pattern=pattern -v window=15 '
BEGIN{RS="@";FS=OFS="\n"}
{pos = match(, pattern); n_del=pos+length(pattern)}
pos && (n_del<=window){ = " " pattern; =substr(, n_del); =substr(, n_del)}
NR!=1{printf "%s%s", RS, [=10=]}
' file
输入:
@M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8
CATCTACpatternATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC..
+
ACCCGGGGGGGGGDGGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFGGGGFGF..
@M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8
CATCTACGCpatternATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC..
+
ACCCGGGGDGGGGGGDGGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFGGGGFGF..
输出:
@M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8 pattern
ATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC..
+
GGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFGGGGFGF..
@M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8
CATCTACGCpatternATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC..
+
ACCCGGGGDGGGGGGDGGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFGGGGFGF..
第二个块未更新,因为 window 是 15,它无法在此 window 中找到模式。
我使用变量 RS
来处理整个 4 行块 [=14=]
、</code>、<code>
、</code> 和 <code>
.因为输入文件以 RS
开头并且不以 RS
结尾,所以我宁愿不设置 ORS
并使用 printf
而不是 print
.
$ cat tst.awk
BEGIN {
tgtStr = "pattern"
tgtLgth = length(tgtStr)
winLgth = 35
numLines = 4
}
{
lineNr = ( (NR-1) % numLines ) + 1
rec[lineNr] = [=10=]
}
lineNr == numLines {
if ( idx = index(substr(rec[2],1,winLgth),tgtStr) ) {
rec[1] = rec[1] " " tgtStr
rec[2] = substr(rec[2],idx+tgtLgth)
rec[4] = substr(rec[4],idx+tgtLgth)
}
for ( lineNr=1; lineNr<=numLines; lineNr++ ) {
print rec[lineNr]
}
}
$ awk -f tst.awk file
@M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8 pattern
ATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC..
+
GGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFGGGGFGF..
根据您发布的代码:
substr([=12=],0,35)
- awk 中的字符串、字段、行号和数组从 1 而不是 0 开始,因此应该是substr([=13=],1,35)
。 awk 会补偿你的错误,并把它当作你在这种情况下写了 1 而不是 0,但习惯于在1
开始一切,以避免在重要的时候出错。for(i=0;i<=1;i++)
- 出于同样的原因应该是for(i=1;i<=2;i++)
。getline
- 用法不当且语法脆弱,请参阅 for(i=0;i<=1;i++)
更新 - 根据您在下面的评论,pattern
实际上是一个正则表达式而不是字符串:
$ cat tst.awk
BEGIN {
tgtRegexp = "[A-Z]{5,}ACA[A-Z]{5,}ACA[A-Z]{5,}"
winLgth = 35
numLines = 4
}
{
lineNr = ( (NR-1) % numLines ) + 1
rec[lineNr] = [=11=]
}
lineNr == numLines {
if ( match(substr(rec[2],1,winLgth),tgtRegexp) ) {
rec[1] = rec[1] " " substr(rec[2],RSTART,RLENGTH)
rec[2] = substr(rec[2],RSTART+RLENGTH)
rec[4] = substr(rec[4],RSTART+RLENGTH)
}
for ( lineNr=1; lineNr<=numLines; lineNr++ ) {
print rec[lineNr]
}
}