从 FASTQ 文件中提取 ID 和序列
Extracting ID and sequence from a FASTQ file
我正在尝试操作 Fastq 文件。
它看起来像这样:
@HWUSI-EAS610:1:1:3:1131#0/1
GATGCTAAGCCCCTAAGGTCATAAGACTGNNANGTC
+
B<ABA<;B@=4A9@:6@96:1??9;>##########
@HWUSI-EAS610:1:1:3:888#0/1
GATAGGACCAAACATCTAACATCTTCCCGNNGNTTC
+
B9>>ABA@B7BB:7?@####################
@HWUSI-EAS610:1:1:4:941#0/1
GCTTAGGAAGGAAGGAAGGAAGGGGTGTTCTGTAGT
+
BBBB:CB=@CB@?BA/@BA;6>BBA8A6A<?A4?B=
...
...
...
@HWUSI-EAS610:1:1:7:1951#0/1
TGATAGATAAGTGCCTACCTGCTTACGTTACTCTCC
+
BB=A6A9>BBB9B;B:B?B@BA@AB@B:74:;8=>7
我的预期输出是:
@HWUSI-EAS610:1:1:3:1131#0/1
GACNTNNCAGTCTTATGACCTTAGGGGCTTAGCATC
@HWUSI-EAS610:1:1:3:888#0/1
GAANCNNCGGGAAGATGTTAGATGTTTGGTCCTATC
@HWUSI-EAS610:1:1:4:941#0/1
ACTACAGAACACCCCTTCCTTCCTTCCTTCCTAAGC
因此,ID 行是以@HWUSI 开头的行(即@HWUSI-EAS610:1:1:7:1951#0/1)。每个ID 之后都有一行及其序列。
现在,我想获得一个只有每个ID及其对应序列的文件,并且该序列应该是反向和互补的。 (A=T, T=A, C=G, G=C)
使用 Sed 我可以获得所有序列的反向和互补命令
sed -n '2~4p' MYFILE.fq | rev | tr ATCG TAGC
如何获取对应的ID?
使用 sed:
sed -n '/@HWUSI/ { p; s/.*//; N; :a /\n$/! { s/\n\(.*\)\(.\)/\n/; ba }; y/ATCG/TAGC/; p }' filename
其工作方式如下:
/@HWUSI/ { # If a line starts with @HWUSI
p # print it
s/.*// # empty the pattern space
N # fetch the sequence line. It is now preceded
# by a newline in the pattern space. That is
# going to be our cursor
:a # jump label for looping
/\n$/! { # while the cursor has not arrived at the end
s/\n\(.*\)\(.\)/\n/ # move the last character before the cursor
ba # go back to a. This loop reverses the sequence
}
y/ATCG/TAGC/ # then invert it
p # and print it.
}
我特意将换行符留在那里以增加可读性;如果不需要,请将最后一个 p
替换为 P
(大写而不是小写)。 p
打印整个模式 space,P
只打印第一个换行符之前的内容。
$ sed -n '/^[^@]/y/ATCG/TAGC/;/^@/p;/^[ATCGN]*$/p' file
@HWUSI-EAS610:1:1:3:1131#0/1
CTACGATTCGGGGATTCCAGTATTCTGACNNTNCAG
@HWUSI-EAS610:1:1:3:888#0/1
CTATCCTGGTTTGTAGATTGTAGAAGGGCNNCNAAG
@HWUSI-EAS610:1:1:4:941#0/1
CGAATCCTTCCTTCCTTCCTTCCCCACAAGACATCA
@HWUSI-EAS610:1:1:7:1951#0/1
ACTATCTATTCACGGATGGACGAATGCAATGAGAGG
说明
/^[^@]/y/ATCG/TAGC/ # Translate bases on lines that don't start with an @
/^@/p # Print IDs
/^[ATCGN]*$/p # Print sequence lines
我正在尝试操作 Fastq 文件。 它看起来像这样:
@HWUSI-EAS610:1:1:3:1131#0/1
GATGCTAAGCCCCTAAGGTCATAAGACTGNNANGTC
+
B<ABA<;B@=4A9@:6@96:1??9;>##########
@HWUSI-EAS610:1:1:3:888#0/1
GATAGGACCAAACATCTAACATCTTCCCGNNGNTTC
+
B9>>ABA@B7BB:7?@####################
@HWUSI-EAS610:1:1:4:941#0/1
GCTTAGGAAGGAAGGAAGGAAGGGGTGTTCTGTAGT
+
BBBB:CB=@CB@?BA/@BA;6>BBA8A6A<?A4?B=
...
...
...
@HWUSI-EAS610:1:1:7:1951#0/1
TGATAGATAAGTGCCTACCTGCTTACGTTACTCTCC
+
BB=A6A9>BBB9B;B:B?B@BA@AB@B:74:;8=>7
我的预期输出是:
@HWUSI-EAS610:1:1:3:1131#0/1
GACNTNNCAGTCTTATGACCTTAGGGGCTTAGCATC
@HWUSI-EAS610:1:1:3:888#0/1
GAANCNNCGGGAAGATGTTAGATGTTTGGTCCTATC
@HWUSI-EAS610:1:1:4:941#0/1
ACTACAGAACACCCCTTCCTTCCTTCCTTCCTAAGC
因此,ID 行是以@HWUSI 开头的行(即@HWUSI-EAS610:1:1:7:1951#0/1)。每个ID 之后都有一行及其序列。 现在,我想获得一个只有每个ID及其对应序列的文件,并且该序列应该是反向和互补的。 (A=T, T=A, C=G, G=C) 使用 Sed 我可以获得所有序列的反向和互补命令
sed -n '2~4p' MYFILE.fq | rev | tr ATCG TAGC
如何获取对应的ID?
使用 sed:
sed -n '/@HWUSI/ { p; s/.*//; N; :a /\n$/! { s/\n\(.*\)\(.\)/\n/; ba }; y/ATCG/TAGC/; p }' filename
其工作方式如下:
/@HWUSI/ { # If a line starts with @HWUSI
p # print it
s/.*// # empty the pattern space
N # fetch the sequence line. It is now preceded
# by a newline in the pattern space. That is
# going to be our cursor
:a # jump label for looping
/\n$/! { # while the cursor has not arrived at the end
s/\n\(.*\)\(.\)/\n/ # move the last character before the cursor
ba # go back to a. This loop reverses the sequence
}
y/ATCG/TAGC/ # then invert it
p # and print it.
}
我特意将换行符留在那里以增加可读性;如果不需要,请将最后一个 p
替换为 P
(大写而不是小写)。 p
打印整个模式 space,P
只打印第一个换行符之前的内容。
$ sed -n '/^[^@]/y/ATCG/TAGC/;/^@/p;/^[ATCGN]*$/p' file
@HWUSI-EAS610:1:1:3:1131#0/1
CTACGATTCGGGGATTCCAGTATTCTGACNNTNCAG
@HWUSI-EAS610:1:1:3:888#0/1
CTATCCTGGTTTGTAGATTGTAGAAGGGCNNCNAAG
@HWUSI-EAS610:1:1:4:941#0/1
CGAATCCTTCCTTCCTTCCTTCCCCACAAGACATCA
@HWUSI-EAS610:1:1:7:1951#0/1
ACTATCTATTCACGGATGGACGAATGCAATGAGAGG
说明
/^[^@]/y/ATCG/TAGC/ # Translate bases on lines that don't start with an @
/^@/p # Print IDs
/^[ATCGN]*$/p # Print sequence lines