将字符串匹配并附加到 headers
matching and appending strings to headers
我想将字符串附加到 FASTA 文件中的序列 headers。
输入:
>uce-101_seqname
GGCTGGCACCAGTTAACTTGGGATATTGGAGTGAAAAGGCCCGTAATCAGCCTTCGGTCATGTAGAACAATGCATAAAATTAAATTGACATTAATGAATAATTGTGTAATGAAAATGGA
期望的输出:
>uce-101_seqname |uce-101
GGCTGGCACCAGTTAACTTGGGATATTGGAGTGAAAAGGCCCGTAATCAGCCTTCGGTCATGTAGAACAATGCATAAAATTAAATTGACATTAATGAATAATTGTGTAATGAAAATGGA
示例代码:
awk -F ">" '{if( ~ /^uce/){print [=13=] " |" substr(,1,7)} else {print [=13=]}}' <inputfile>
示例代码仅适用于 7 个字符(例如 uce-101)。我需要它适用于大于和小于 7 个字符(例如 uce-1、uce-10、uce-1001)。
我认为 shellter has hit the nail on the head with his comment above. With that, your line of awk 可以简化为:
awk -F '>' '~/^uce/ { x=; sub(/_.*/,"",x); print [=10=], "|" x; next }1' file
结果:
>uce-101_seqname |uce-101
GGCTGGCACCAGTTAACTTGGGATATTGGAGTGAAAAGGCCCGTAATCAGCCTTCGGTCATGTAGAACAATGCATAAAATTAAATTGACATTAATGAATAATTGTGTAATGAAAATGGA
但是,如果您更喜欢 sed 解决方案,您可以尝试:
sed '/^>uce/s/>\([^_]*\).*/& |/' file
结果:
>uce-101_seqname |uce-101
GGCTGGCACCAGTTAACTTGGGATATTGGAGTGAAAAGGCCCGTAATCAGCCTTCGGTCATGTAGAACAATGCATAAAATTAAATTGACATTAATGAATAATTGTGTAATGAAAATGGA
解释:
/^>uce/ # This is an address that specifies which lines are to be
# examined or modified. In this case, only lines beginning
# the string 'uce' are to be addressed.
s/../../ # Perform a substitution using the '/' delimiter
>\([^_]*\).* # This is the pattern to be matched. The '>' character is a
# literal '>'. Escaped parentheses are then used to capture
# a character class that says any character not an
# underscore any (zero or more) number of times. All this
# is then followed by any character any number of times.
& | # This is the replacement string. The '&' character is the
# whole pattern that was found. This is followed by a
# literal space and a literal pipe character. '' is then
# our pattern that we kept using our escaped parentheses.
应该这样做:
awk -F">|_" 'NF>2 {[=10=]=[=10=]" |"}1' file
>uce-101_seqname |uce-101
GGCTGGCACCAGTTAACTTGGGATATTGGAGTGAAAAGGCCCGTAATCAGCCTTCGGTCATGTAGAACAATGCATAAAATTAAATTGACATTAATGAATAATTGTGTAATGAAAATGGA
将字段分隔符设置为 >
或 _
如果行包含两个以上的字段,重新创建行
打印所有行。
如果您需要测试 uce
,那么应该这样做:
awk -F">|_" '~/^uce/ {[=11=]=[=11=]" |"}1' file
我想将字符串附加到 FASTA 文件中的序列 headers。
输入:
>uce-101_seqname
GGCTGGCACCAGTTAACTTGGGATATTGGAGTGAAAAGGCCCGTAATCAGCCTTCGGTCATGTAGAACAATGCATAAAATTAAATTGACATTAATGAATAATTGTGTAATGAAAATGGA
期望的输出:
>uce-101_seqname |uce-101
GGCTGGCACCAGTTAACTTGGGATATTGGAGTGAAAAGGCCCGTAATCAGCCTTCGGTCATGTAGAACAATGCATAAAATTAAATTGACATTAATGAATAATTGTGTAATGAAAATGGA
示例代码:
awk -F ">" '{if( ~ /^uce/){print [=13=] " |" substr(,1,7)} else {print [=13=]}}' <inputfile>
示例代码仅适用于 7 个字符(例如 uce-101)。我需要它适用于大于和小于 7 个字符(例如 uce-1、uce-10、uce-1001)。
我认为 shellter has hit the nail on the head with his comment above. With that, your line of awk 可以简化为:
awk -F '>' '~/^uce/ { x=; sub(/_.*/,"",x); print [=10=], "|" x; next }1' file
结果:
>uce-101_seqname |uce-101
GGCTGGCACCAGTTAACTTGGGATATTGGAGTGAAAAGGCCCGTAATCAGCCTTCGGTCATGTAGAACAATGCATAAAATTAAATTGACATTAATGAATAATTGTGTAATGAAAATGGA
但是,如果您更喜欢 sed 解决方案,您可以尝试:
sed '/^>uce/s/>\([^_]*\).*/& |/' file
结果:
>uce-101_seqname |uce-101
GGCTGGCACCAGTTAACTTGGGATATTGGAGTGAAAAGGCCCGTAATCAGCCTTCGGTCATGTAGAACAATGCATAAAATTAAATTGACATTAATGAATAATTGTGTAATGAAAATGGA
解释:
/^>uce/ # This is an address that specifies which lines are to be
# examined or modified. In this case, only lines beginning
# the string 'uce' are to be addressed.
s/../../ # Perform a substitution using the '/' delimiter
>\([^_]*\).* # This is the pattern to be matched. The '>' character is a
# literal '>'. Escaped parentheses are then used to capture
# a character class that says any character not an
# underscore any (zero or more) number of times. All this
# is then followed by any character any number of times.
& | # This is the replacement string. The '&' character is the
# whole pattern that was found. This is followed by a
# literal space and a literal pipe character. '' is then
# our pattern that we kept using our escaped parentheses.
应该这样做:
awk -F">|_" 'NF>2 {[=10=]=[=10=]" |"}1' file
>uce-101_seqname |uce-101
GGCTGGCACCAGTTAACTTGGGATATTGGAGTGAAAAGGCCCGTAATCAGCCTTCGGTCATGTAGAACAATGCATAAAATTAAATTGACATTAATGAATAATTGTGTAATGAAAATGGA
将字段分隔符设置为 >
或 _
如果行包含两个以上的字段,重新创建行
打印所有行。
如果您需要测试 uce
,那么应该这样做:
awk -F">|_" '~/^uce/ {[=11=]=[=11=]" |"}1' file