将文本添加到 fasta 文件的 header
add text to a header of a fasta file
我有一个基因组的 fasta 文件 (txt),类似于:
$ cat Strain-01.faa
>IMEHDJCA_03186 Serine/threonine-protein phosphatase 2
MEFKHRFIDGSRYQRIFVIGDIHGKLALLQDTLKRVDFHGERDLLISVGDLIDRGPDSVG
VLDYYQTHDWFEAVMGNHEWMMVNALDAQNKLERSEKEAYFIKIWHRNGCEWSQNL
>IMEHDJCA_03187 Serine transporter
MKESRETLNFSDTLPTETWTKHDTHWVLSLFGTAVGAGILFLPINLGIGGFWPLVLLALL
AFPMTFWGHRALARFVLSSKQADADFTDVVEEHFGAKAGRLISLLYFLSIFPILLIYGVG
>IMEHDJCA_03189 hypothetical protein
MNNQRHGITFGIERIGSQTILVFKATGTLTHQDYQAIAPVLEAALAGINRQQMNMLADIS
EFSGWEPRAAWDDFQLGLKIGFSVNKVAVYGDKNWQELAAKVGSWFISGEMKSFGD
我想在 file.txt.
的列表中添加一个额外的 ID
$ cat file.txt
ID Gene Strain-01 Strain-02 Strain-03
ID_01 pphB IMEHDJCA_03186 DIBHEKPI_01648 LLMDBGDK_00598
ID_02 group_1001 IMEHDJCA_03187 DIBHEKPI_01635 LLMDBGDK_00611
ID_03 group_1002 IMEHDJCA_03189 DIBHEKPI_01628 LLMDBGDK_00616
例如fasta Strain-01.faa
文件有IMEHDJCA_03186
id对应于Strain-01
,所以我想添加列ID的ID_01
号(从 file.txt
) 到序列的 header,类似于:
ID_01
对应IMEHDJCA_03186
ID_02
对应IMEHDJCA_03187
ID_03
对应IMEHDJCA_03189
结果会是这样的:
$cat Strain-01_edited.faa
>ID_01 IMEHDJCA_03186 Serine/threonine-protein phosphatase 2
MEFKHRFIDGSRYQRIFVIGDIHGKLALLQDTLKRVDFHGERDLLISVGDLIDRGPDSVG
VLDYYQTHDWFEAVMGNHEWMMVNALDAQNKLERSEKEAYFIKIWHRNGCEWSQNL
>ID_02 IMEHDJCA_03187 Serine transporter
MKESRETLNFSDTLPTETWTKHDTHWVLSLFGTAVGAGILFLPINLGIGGFWPLVLLALL
AFPMTFWGHRALARFVLSSKQADADFTDVVEEHFGAKAGRLISLLYFLSIFPILLIYGVG
>ID_03 IMEHDJCA_03189 hypothetical protein
MNNQRHGITFGIERIGSQTILVFKATGTLTHQDYQAIAPVLEAALAGINRQQMNMLADIS
EFSGWEPRAAWDDFQLGLKIGFSVNKVAVYGDKNWQELAAKVGSWFISGEMKSFGD
我只是想在fasta文件的header中添加一个file.txt
的ID码
有什么想法吗? bash
或 R
或任何其他方式 ?
非常感谢
更新 #2: 通过 awk
:
消除特定菌株(名称)处理
- 我们会将所有可能的 strain/ID 映射加载到
awk
- 这将允许处理任何
*.faa
文件而无需知道菌株名称
- 这将允许处理
*.faa
混合菌株的文件(不知道这是否是 OP 必须解决的问题)
- 降低了
awk
代码的复杂性(与 UPDATE #1 相比),但代价是额外的内存用于更多 id[]
数组条目
样本数据(第一个字段中的菌株混合):
# for this (nonsensical?) file the first 3 blocks include a strain
# from each of the 3 columns (of strain names) from file.txt; the
# 4th block contains a nonsensical strain that doesn't exist in
# file.txt (ie, 4th line should not see an insertion of a ID value)
$ cat Strain-mixed.faa
>IMEHDJCA_03186 Serine transporter
MKESRETLNFSDTLPTETWTKHDTHWVLSLFGTAVGAGILFLPINLGIGGFWPLVLLALL
AFPMTFWGHRALARFVLSSKQADADFTDVVEEHFGAKAGRLISLLYFLSIFPILLIYGVG
>DIBHEKPI_01635 Serine/threonine-protein phosphatase 2
MEFKHRFIDGSRYQRIFVIGDIHGKLALLQDTLKRVDFHGERDLLISVGDLIDRGPDSVG
VLDYYQTHDWFEAVMGNHEWMMVNALDAQNKLERSEKEAYFIKIWHRNGCEWSQNL
>LLMDBGDK_00616 hypothetical protein
MNNQRHGITFGIERIGSQTILVFKATGTLTHQDYQAIAPVLEAALAGINRQQMNMLADIS
EFSGWEPRAAWDDFQLGLKIGFSVNKVAVYGDKNWQELAAKVGSWFISGEMKSFGD
>NO_MATCH hypothetical protein
MNNQRHGITFGIERIGSQTILVFKATGTLTHQDYQAIAPVLEAALAGINRQQMNMLADIS
EFSGWEPRAAWDDFQLGLKIGFSVNKVAVYGDKNWQELAAKVGSWFISGEMKSFGD
用于将所有菌株加载到 id[]
数组中的新 awk
代码:
awk '
NR==1 { next } # skip 1st line of 1st file
FNR==NR { for (i=3; i<=NF; i++) # for rest of 1st file load id[] with ...
id[$i]= # all strain/ID combos
next
}
/^>/ { # for 2nd file, if 1st column is ">"
ndx=substr(,2) # strip off ">"
if ( ndx in id ) # if 1st field (sans ">") is an index in id[] then ...
( =">" id[ndx] " " ndx ) # rewrite 1st field to include our id[] value
}
1 # print current line (of 2nd file)
' file.txt Strain-mixed.faa
这会生成:
>ID_01 IMEHDJCA_03186 Serine transporter
MKESRETLNFSDTLPTETWTKHDTHWVLSLFGTAVGAGILFLPINLGIGGFWPLVLLALL
AFPMTFWGHRALARFVLSSKQADADFTDVVEEHFGAKAGRLISLLYFLSIFPILLIYGVG
>ID_02 DIBHEKPI_01635 Serine/threonine-protein phosphatase 2
MEFKHRFIDGSRYQRIFVIGDIHGKLALLQDTLKRVDFHGERDLLISVGDLIDRGPDSVG
VLDYYQTHDWFEAVMGNHEWMMVNALDAQNKLERSEKEAYFIKIWHRNGCEWSQNL
>ID_03 LLMDBGDK_00616 hypothetical protein
MNNQRHGITFGIERIGSQTILVFKATGTLTHQDYQAIAPVLEAALAGINRQQMNMLADIS
EFSGWEPRAAWDDFQLGLKIGFSVNKVAVYGDKNWQELAAKVGSWFISGEMKSFGD
>NO_MATCH hypothetical protein
MNNQRHGITFGIERIGSQTILVFKATGTLTHQDYQAIAPVLEAALAGINRQQMNMLADIS
EFSGWEPRAAWDDFQLGLKIGFSVNKVAVYGDKNWQELAAKVGSWFISGEMKSFGD
注意: 此最新更新将为 Strain-{01,02}.faa
文件中的所有行执行 ID 插入(参见 UPDATE #1, 下面)。
更新 #1: 扩展原始答案以解决(我认为)Paul Hodges 关于概括答案以支持可变菌株名称的问题:
- 动态确定使用来自
file.txt
的哪一列菌株
- 动态处理匹配的
<strain>.faa
文件
示例数据:
$ cat Strain-01.faa
>IMEHDJCA_03186 Serine/threonine-protein phosphatase 2
MEFKHRFIDGSRYQRIFVIGDIHGKLALLQDTLKRVDFHGERDLLISVGDLIDRGPDSVG
VLDYYQTHDWFEAVMGNHEWMMVNALDAQNKLERSEKEAYFIKIWHRNGCEWSQNL
>IMEHDJCA_03187 Serine transporter
MKESRETLNFSDTLPTETWTKHDTHWVLSLFGTAVGAGILFLPINLGIGGFWPLVLLALL
AFPMTFWGHRALARFVLSSKQADADFTDVVEEHFGAKAGRLISLLYFLSIFPILLIYGVG
>IMEHDJCA_03189 hypothetical protein
MNNQRHGITFGIERIGSQTILVFKATGTLTHQDYQAIAPVLEAALAGINRQQMNMLADIS
EFSGWEPRAAWDDFQLGLKIGFSVNKVAVYGDKNWQELAAKVGSWFISGEMKSFGD
# for this next file I simply copied data from OP's Strain-01.faa and
# modified the initial field for blocks 1 & 3; net result is we should
# see 2 of the blocks receive insertions of ID values
$ cat Strain-02.faa
>DIBHEKPI_01635 Serine/threonine-protein phosphatase 2
MEFKHRFIDGSRYQRIFVIGDIHGKLALLQDTLKRVDFHGERDLLISVGDLIDRGPDSVG
VLDYYQTHDWFEAVMGNHEWMMVNALDAQNKLERSEKEAYFIKIWHRNGCEWSQNL
>IMEHDJCA_03187 Serine transporter
MKESRETLNFSDTLPTETWTKHDTHWVLSLFGTAVGAGILFLPINLGIGGFWPLVLLALL
AFPMTFWGHRALARFVLSSKQADADFTDVVEEHFGAKAGRLISLLYFLSIFPILLIYGVG
>DIBHEKPI_01648 hypothetical protein
MNNQRHGITFGIERIGSQTILVFKATGTLTHQDYQAIAPVLEAALAGINRQQMNMLADIS
EFSGWEPRAAWDDFQLGLKIGFSVNKVAVYGDKNWQELAAKVGSWFISGEMKSFGD
$ cat Strain-XX.faa
cat: Strain-XX.faa: No such file or directory
对原始 awk
答案进行了一些修改,并包装在 (bash
) for
循环中以处理不同的菌株:
for strain in Strain-01 Strain-02 Strain-XX
do
printf "\n############### ${strain} / ${strain}.faa\n\n"
awk -v strain="${strain}" ' # pass bash variable in as awk variable (same name)
NR==1 { for (i=3; i<=NF; i++) # 1st row of 1st file, look for matching strain name
{ if ( $i == strain ) # if we find a match then ...
{ strain_ndx=i # make note of the column and ...
next # skip to next line from 1st file
}
}
# if we got here we did not find a matching strain name so
# print a message and exit from our awk script
print "Unable to locate entry for "strain" in "FILENAME". Aborting."
exit
}
FNR==NR { id[$(strain_ndx)]= # for rest of 1st file build array of ids
next
}
/^>/ { # for 2nd file, if 1st column is ">"
ndx=substr(,2) # strip off ">"
if ( ndx in id ) # if 1st field (sans ">") is an index in id[] then ...
( =">" id[ndx] " " ndx ) # rewrite 1st field to include our id[] value
}
1 # print current line (of 2nd file)
' file.txt "${strain}.faa"
done
这会生成:
############### Strain-01 / Strain-01.faa
>ID_01 IMEHDJCA_03186 Serine/threonine-protein phosphatase 2
MEFKHRFIDGSRYQRIFVIGDIHGKLALLQDTLKRVDFHGERDLLISVGDLIDRGPDSVG
VLDYYQTHDWFEAVMGNHEWMMVNALDAQNKLERSEKEAYFIKIWHRNGCEWSQNL
>ID_02 IMEHDJCA_03187 Serine transporter
MKESRETLNFSDTLPTETWTKHDTHWVLSLFGTAVGAGILFLPINLGIGGFWPLVLLALL
AFPMTFWGHRALARFVLSSKQADADFTDVVEEHFGAKAGRLISLLYFLSIFPILLIYGVG
>ID_03 IMEHDJCA_03189 hypothetical protein
MNNQRHGITFGIERIGSQTILVFKATGTLTHQDYQAIAPVLEAALAGINRQQMNMLADIS
EFSGWEPRAAWDDFQLGLKIGFSVNKVAVYGDKNWQELAAKVGSWFISGEMKSFGD
############### Strain-02 / Strain-02.faa
>ID_02 DIBHEKPI_01635 Serine/threonine-protein phosphatase 2
MEFKHRFIDGSRYQRIFVIGDIHGKLALLQDTLKRVDFHGERDLLISVGDLIDRGPDSVG
VLDYYQTHDWFEAVMGNHEWMMVNALDAQNKLERSEKEAYFIKIWHRNGCEWSQNL
>IMEHDJCA_03187 Serine transporter
MKESRETLNFSDTLPTETWTKHDTHWVLSLFGTAVGAGILFLPINLGIGGFWPLVLLALL
AFPMTFWGHRALARFVLSSKQADADFTDVVEEHFGAKAGRLISLLYFLSIFPILLIYGVG
>ID_01 DIBHEKPI_01648 hypothetical protein
MNNQRHGITFGIERIGSQTILVFKATGTLTHQDYQAIAPVLEAALAGINRQQMNMLADIS
EFSGWEPRAAWDDFQLGLKIGFSVNKVAVYGDKNWQELAAKVGSWFISGEMKSFGD
############### Strain-XX / Strain-XX.faa
Unable to locate entry for Strain-XX in file.txt. Aborting.
原始答案
一个awk
想法:
awk '
FNR==NR { id[]= ; next } # for 1st file build array of ids
/^>/ { # for 2nd file, if 1st column is ">"
ndx=substr(,2) # strip off ">"
if ( ndx in id ) # if 1st field (sans ">") is an index in id[] then ...
( =">" id[ndx] " " ndx ) # rewrite 1st field to include our id[] value
}
1 # print current line (of 2nd file)
' file.txt fasta.dat
对于给定的样本数据,这会生成:
>ID_01 IMEHDJCA_03186 Serine/threonine-protein phosphatase 2
MEFKHRFIDGSRYQRIFVIGDIHGKLALLQDTLKRVDFHGERDLLISVGDLIDRGPDSVG
VLDYYQTHDWFEAVMGNHEWMMVNALDAQNKLERSEKEAYFIKIWHRNGCEWSQNL
>ID_02 IMEHDJCA_03187 Serine transporter
MKESRETLNFSDTLPTETWTKHDTHWVLSLFGTAVGAGILFLPINLGIGGFWPLVLLALL
AFPMTFWGHRALARFVLSSKQADADFTDVVEEHFGAKAGRLISLLYFLSIFPILLIYGVG
>ID_03 IMEHDJCA_03189 hypothetical protein
MNNQRHGITFGIERIGSQTILVFKATGTLTHQDYQAIAPVLEAALAGINRQQMNMLADIS
EFSGWEPRAAWDDFQLGLKIGFSVNKVAVYGDKNWQELAAKVGSWFISGEMKSFGD
R 方式要长一些,但如果您想在 REPL 中查看自己在做什么,这可能是一个不错的选择。
# read in file and split into groups, each starting with a line like "> ..."
strain <- readLines(con = 'Strain-01.faa')
strain <- split(strain, cumsum(grepl('^>', strain)))
# extract ids from line 1 of each group
ids <- sapply(strain, function(x) gsub('^>(\w+).*', '\1', x[1]))
# read in ID lookup table and match to the extracted IDs
id_lkp <- read.table('file.txt', header = TRUE)
id_num <- with(id_lkp, ID[match(ids, Strain.01)])
# for each group, append the id after > to the first line
for(i in seq_along(strain)){
strain[[i]][1] <- sub('^>(\w+)', paste0('>', id_num[i], ' \1'), strain[[i]][1])
}
# write to output file
writeLines(unlist(strain), file('output.txt'))
由 reprex package (v2.0.0)
于 2021-07-19 创建
现在 output.txt 看起来像:
>ID_01 IMEHDJCA_03186 Serine/threonine-protein phosphatase 2
MEFKHRFIDGSRYQRIFVIGDIHGKLALLQDTLKRVDFHGERDLLISVGDLIDRGPDSVG
VLDYYQTHDWFEAVMGNHEWMMVNALDAQNKLERSEKEAYFIKIWHRNGCEWSQNL
>ID_02 IMEHDJCA_03187 Serine transporter
MKESRETLNFSDTLPTETWTKHDTHWVLSLFGTAVGAGILFLPINLGIGGFWPLVLLALL
AFPMTFWGHRALARFVLSSKQADADFTDVVEEHFGAKAGRLISLLYFLSIFPILLIYGVG
>ID_03 IMEHDJCA_03189 hypothetical protein
MNNQRHGITFGIERIGSQTILVFKATGTLTHQDYQAIAPVLEAALAGINRQQMNMLADIS
EFSGWEPRAAWDDFQLGLKIGFSVNKVAVYGDKNWQELAAKVGSWFISGEMKSFGD
我有一个基因组的 fasta 文件 (txt),类似于:
$ cat Strain-01.faa
>IMEHDJCA_03186 Serine/threonine-protein phosphatase 2
MEFKHRFIDGSRYQRIFVIGDIHGKLALLQDTLKRVDFHGERDLLISVGDLIDRGPDSVG
VLDYYQTHDWFEAVMGNHEWMMVNALDAQNKLERSEKEAYFIKIWHRNGCEWSQNL
>IMEHDJCA_03187 Serine transporter
MKESRETLNFSDTLPTETWTKHDTHWVLSLFGTAVGAGILFLPINLGIGGFWPLVLLALL
AFPMTFWGHRALARFVLSSKQADADFTDVVEEHFGAKAGRLISLLYFLSIFPILLIYGVG
>IMEHDJCA_03189 hypothetical protein
MNNQRHGITFGIERIGSQTILVFKATGTLTHQDYQAIAPVLEAALAGINRQQMNMLADIS
EFSGWEPRAAWDDFQLGLKIGFSVNKVAVYGDKNWQELAAKVGSWFISGEMKSFGD
我想在 file.txt.
的列表中添加一个额外的 ID$ cat file.txt
ID Gene Strain-01 Strain-02 Strain-03
ID_01 pphB IMEHDJCA_03186 DIBHEKPI_01648 LLMDBGDK_00598
ID_02 group_1001 IMEHDJCA_03187 DIBHEKPI_01635 LLMDBGDK_00611
ID_03 group_1002 IMEHDJCA_03189 DIBHEKPI_01628 LLMDBGDK_00616
例如fasta Strain-01.faa
文件有IMEHDJCA_03186
id对应于Strain-01
,所以我想添加列ID的ID_01
号(从 file.txt
) 到序列的 header,类似于:
ID_01
对应IMEHDJCA_03186
ID_02
对应IMEHDJCA_03187
ID_03
对应IMEHDJCA_03189
结果会是这样的:
$cat Strain-01_edited.faa
>ID_01 IMEHDJCA_03186 Serine/threonine-protein phosphatase 2
MEFKHRFIDGSRYQRIFVIGDIHGKLALLQDTLKRVDFHGERDLLISVGDLIDRGPDSVG
VLDYYQTHDWFEAVMGNHEWMMVNALDAQNKLERSEKEAYFIKIWHRNGCEWSQNL
>ID_02 IMEHDJCA_03187 Serine transporter
MKESRETLNFSDTLPTETWTKHDTHWVLSLFGTAVGAGILFLPINLGIGGFWPLVLLALL
AFPMTFWGHRALARFVLSSKQADADFTDVVEEHFGAKAGRLISLLYFLSIFPILLIYGVG
>ID_03 IMEHDJCA_03189 hypothetical protein
MNNQRHGITFGIERIGSQTILVFKATGTLTHQDYQAIAPVLEAALAGINRQQMNMLADIS
EFSGWEPRAAWDDFQLGLKIGFSVNKVAVYGDKNWQELAAKVGSWFISGEMKSFGD
我只是想在fasta文件的header中添加一个file.txt
的ID码
有什么想法吗? bash
或 R
或任何其他方式 ?
非常感谢
更新 #2: 通过 awk
:
- 我们会将所有可能的 strain/ID 映射加载到
awk
- 这将允许处理任何
*.faa
文件而无需知道菌株名称 - 这将允许处理
*.faa
混合菌株的文件(不知道这是否是 OP 必须解决的问题) - 降低了
awk
代码的复杂性(与 UPDATE #1 相比),但代价是额外的内存用于更多id[]
数组条目
样本数据(第一个字段中的菌株混合):
# for this (nonsensical?) file the first 3 blocks include a strain
# from each of the 3 columns (of strain names) from file.txt; the
# 4th block contains a nonsensical strain that doesn't exist in
# file.txt (ie, 4th line should not see an insertion of a ID value)
$ cat Strain-mixed.faa
>IMEHDJCA_03186 Serine transporter
MKESRETLNFSDTLPTETWTKHDTHWVLSLFGTAVGAGILFLPINLGIGGFWPLVLLALL
AFPMTFWGHRALARFVLSSKQADADFTDVVEEHFGAKAGRLISLLYFLSIFPILLIYGVG
>DIBHEKPI_01635 Serine/threonine-protein phosphatase 2
MEFKHRFIDGSRYQRIFVIGDIHGKLALLQDTLKRVDFHGERDLLISVGDLIDRGPDSVG
VLDYYQTHDWFEAVMGNHEWMMVNALDAQNKLERSEKEAYFIKIWHRNGCEWSQNL
>LLMDBGDK_00616 hypothetical protein
MNNQRHGITFGIERIGSQTILVFKATGTLTHQDYQAIAPVLEAALAGINRQQMNMLADIS
EFSGWEPRAAWDDFQLGLKIGFSVNKVAVYGDKNWQELAAKVGSWFISGEMKSFGD
>NO_MATCH hypothetical protein
MNNQRHGITFGIERIGSQTILVFKATGTLTHQDYQAIAPVLEAALAGINRQQMNMLADIS
EFSGWEPRAAWDDFQLGLKIGFSVNKVAVYGDKNWQELAAKVGSWFISGEMKSFGD
用于将所有菌株加载到 id[]
数组中的新 awk
代码:
awk '
NR==1 { next } # skip 1st line of 1st file
FNR==NR { for (i=3; i<=NF; i++) # for rest of 1st file load id[] with ...
id[$i]= # all strain/ID combos
next
}
/^>/ { # for 2nd file, if 1st column is ">"
ndx=substr(,2) # strip off ">"
if ( ndx in id ) # if 1st field (sans ">") is an index in id[] then ...
( =">" id[ndx] " " ndx ) # rewrite 1st field to include our id[] value
}
1 # print current line (of 2nd file)
' file.txt Strain-mixed.faa
这会生成:
>ID_01 IMEHDJCA_03186 Serine transporter
MKESRETLNFSDTLPTETWTKHDTHWVLSLFGTAVGAGILFLPINLGIGGFWPLVLLALL
AFPMTFWGHRALARFVLSSKQADADFTDVVEEHFGAKAGRLISLLYFLSIFPILLIYGVG
>ID_02 DIBHEKPI_01635 Serine/threonine-protein phosphatase 2
MEFKHRFIDGSRYQRIFVIGDIHGKLALLQDTLKRVDFHGERDLLISVGDLIDRGPDSVG
VLDYYQTHDWFEAVMGNHEWMMVNALDAQNKLERSEKEAYFIKIWHRNGCEWSQNL
>ID_03 LLMDBGDK_00616 hypothetical protein
MNNQRHGITFGIERIGSQTILVFKATGTLTHQDYQAIAPVLEAALAGINRQQMNMLADIS
EFSGWEPRAAWDDFQLGLKIGFSVNKVAVYGDKNWQELAAKVGSWFISGEMKSFGD
>NO_MATCH hypothetical protein
MNNQRHGITFGIERIGSQTILVFKATGTLTHQDYQAIAPVLEAALAGINRQQMNMLADIS
EFSGWEPRAAWDDFQLGLKIGFSVNKVAVYGDKNWQELAAKVGSWFISGEMKSFGD
注意: 此最新更新将为 Strain-{01,02}.faa
文件中的所有行执行 ID 插入(参见 UPDATE #1, 下面)。
更新 #1: 扩展原始答案以解决(我认为)Paul Hodges 关于概括答案以支持可变菌株名称的问题:
- 动态确定使用来自
file.txt
的哪一列菌株
- 动态处理匹配的
<strain>.faa
文件
示例数据:
$ cat Strain-01.faa
>IMEHDJCA_03186 Serine/threonine-protein phosphatase 2
MEFKHRFIDGSRYQRIFVIGDIHGKLALLQDTLKRVDFHGERDLLISVGDLIDRGPDSVG
VLDYYQTHDWFEAVMGNHEWMMVNALDAQNKLERSEKEAYFIKIWHRNGCEWSQNL
>IMEHDJCA_03187 Serine transporter
MKESRETLNFSDTLPTETWTKHDTHWVLSLFGTAVGAGILFLPINLGIGGFWPLVLLALL
AFPMTFWGHRALARFVLSSKQADADFTDVVEEHFGAKAGRLISLLYFLSIFPILLIYGVG
>IMEHDJCA_03189 hypothetical protein
MNNQRHGITFGIERIGSQTILVFKATGTLTHQDYQAIAPVLEAALAGINRQQMNMLADIS
EFSGWEPRAAWDDFQLGLKIGFSVNKVAVYGDKNWQELAAKVGSWFISGEMKSFGD
# for this next file I simply copied data from OP's Strain-01.faa and
# modified the initial field for blocks 1 & 3; net result is we should
# see 2 of the blocks receive insertions of ID values
$ cat Strain-02.faa
>DIBHEKPI_01635 Serine/threonine-protein phosphatase 2
MEFKHRFIDGSRYQRIFVIGDIHGKLALLQDTLKRVDFHGERDLLISVGDLIDRGPDSVG
VLDYYQTHDWFEAVMGNHEWMMVNALDAQNKLERSEKEAYFIKIWHRNGCEWSQNL
>IMEHDJCA_03187 Serine transporter
MKESRETLNFSDTLPTETWTKHDTHWVLSLFGTAVGAGILFLPINLGIGGFWPLVLLALL
AFPMTFWGHRALARFVLSSKQADADFTDVVEEHFGAKAGRLISLLYFLSIFPILLIYGVG
>DIBHEKPI_01648 hypothetical protein
MNNQRHGITFGIERIGSQTILVFKATGTLTHQDYQAIAPVLEAALAGINRQQMNMLADIS
EFSGWEPRAAWDDFQLGLKIGFSVNKVAVYGDKNWQELAAKVGSWFISGEMKSFGD
$ cat Strain-XX.faa
cat: Strain-XX.faa: No such file or directory
对原始 awk
答案进行了一些修改,并包装在 (bash
) for
循环中以处理不同的菌株:
for strain in Strain-01 Strain-02 Strain-XX
do
printf "\n############### ${strain} / ${strain}.faa\n\n"
awk -v strain="${strain}" ' # pass bash variable in as awk variable (same name)
NR==1 { for (i=3; i<=NF; i++) # 1st row of 1st file, look for matching strain name
{ if ( $i == strain ) # if we find a match then ...
{ strain_ndx=i # make note of the column and ...
next # skip to next line from 1st file
}
}
# if we got here we did not find a matching strain name so
# print a message and exit from our awk script
print "Unable to locate entry for "strain" in "FILENAME". Aborting."
exit
}
FNR==NR { id[$(strain_ndx)]= # for rest of 1st file build array of ids
next
}
/^>/ { # for 2nd file, if 1st column is ">"
ndx=substr(,2) # strip off ">"
if ( ndx in id ) # if 1st field (sans ">") is an index in id[] then ...
( =">" id[ndx] " " ndx ) # rewrite 1st field to include our id[] value
}
1 # print current line (of 2nd file)
' file.txt "${strain}.faa"
done
这会生成:
############### Strain-01 / Strain-01.faa
>ID_01 IMEHDJCA_03186 Serine/threonine-protein phosphatase 2
MEFKHRFIDGSRYQRIFVIGDIHGKLALLQDTLKRVDFHGERDLLISVGDLIDRGPDSVG
VLDYYQTHDWFEAVMGNHEWMMVNALDAQNKLERSEKEAYFIKIWHRNGCEWSQNL
>ID_02 IMEHDJCA_03187 Serine transporter
MKESRETLNFSDTLPTETWTKHDTHWVLSLFGTAVGAGILFLPINLGIGGFWPLVLLALL
AFPMTFWGHRALARFVLSSKQADADFTDVVEEHFGAKAGRLISLLYFLSIFPILLIYGVG
>ID_03 IMEHDJCA_03189 hypothetical protein
MNNQRHGITFGIERIGSQTILVFKATGTLTHQDYQAIAPVLEAALAGINRQQMNMLADIS
EFSGWEPRAAWDDFQLGLKIGFSVNKVAVYGDKNWQELAAKVGSWFISGEMKSFGD
############### Strain-02 / Strain-02.faa
>ID_02 DIBHEKPI_01635 Serine/threonine-protein phosphatase 2
MEFKHRFIDGSRYQRIFVIGDIHGKLALLQDTLKRVDFHGERDLLISVGDLIDRGPDSVG
VLDYYQTHDWFEAVMGNHEWMMVNALDAQNKLERSEKEAYFIKIWHRNGCEWSQNL
>IMEHDJCA_03187 Serine transporter
MKESRETLNFSDTLPTETWTKHDTHWVLSLFGTAVGAGILFLPINLGIGGFWPLVLLALL
AFPMTFWGHRALARFVLSSKQADADFTDVVEEHFGAKAGRLISLLYFLSIFPILLIYGVG
>ID_01 DIBHEKPI_01648 hypothetical protein
MNNQRHGITFGIERIGSQTILVFKATGTLTHQDYQAIAPVLEAALAGINRQQMNMLADIS
EFSGWEPRAAWDDFQLGLKIGFSVNKVAVYGDKNWQELAAKVGSWFISGEMKSFGD
############### Strain-XX / Strain-XX.faa
Unable to locate entry for Strain-XX in file.txt. Aborting.
原始答案
一个awk
想法:
awk '
FNR==NR { id[]= ; next } # for 1st file build array of ids
/^>/ { # for 2nd file, if 1st column is ">"
ndx=substr(,2) # strip off ">"
if ( ndx in id ) # if 1st field (sans ">") is an index in id[] then ...
( =">" id[ndx] " " ndx ) # rewrite 1st field to include our id[] value
}
1 # print current line (of 2nd file)
' file.txt fasta.dat
对于给定的样本数据,这会生成:
>ID_01 IMEHDJCA_03186 Serine/threonine-protein phosphatase 2
MEFKHRFIDGSRYQRIFVIGDIHGKLALLQDTLKRVDFHGERDLLISVGDLIDRGPDSVG
VLDYYQTHDWFEAVMGNHEWMMVNALDAQNKLERSEKEAYFIKIWHRNGCEWSQNL
>ID_02 IMEHDJCA_03187 Serine transporter
MKESRETLNFSDTLPTETWTKHDTHWVLSLFGTAVGAGILFLPINLGIGGFWPLVLLALL
AFPMTFWGHRALARFVLSSKQADADFTDVVEEHFGAKAGRLISLLYFLSIFPILLIYGVG
>ID_03 IMEHDJCA_03189 hypothetical protein
MNNQRHGITFGIERIGSQTILVFKATGTLTHQDYQAIAPVLEAALAGINRQQMNMLADIS
EFSGWEPRAAWDDFQLGLKIGFSVNKVAVYGDKNWQELAAKVGSWFISGEMKSFGD
R 方式要长一些,但如果您想在 REPL 中查看自己在做什么,这可能是一个不错的选择。
# read in file and split into groups, each starting with a line like "> ..."
strain <- readLines(con = 'Strain-01.faa')
strain <- split(strain, cumsum(grepl('^>', strain)))
# extract ids from line 1 of each group
ids <- sapply(strain, function(x) gsub('^>(\w+).*', '\1', x[1]))
# read in ID lookup table and match to the extracted IDs
id_lkp <- read.table('file.txt', header = TRUE)
id_num <- with(id_lkp, ID[match(ids, Strain.01)])
# for each group, append the id after > to the first line
for(i in seq_along(strain)){
strain[[i]][1] <- sub('^>(\w+)', paste0('>', id_num[i], ' \1'), strain[[i]][1])
}
# write to output file
writeLines(unlist(strain), file('output.txt'))
由 reprex package (v2.0.0)
于 2021-07-19 创建现在 output.txt 看起来像:
>ID_01 IMEHDJCA_03186 Serine/threonine-protein phosphatase 2
MEFKHRFIDGSRYQRIFVIGDIHGKLALLQDTLKRVDFHGERDLLISVGDLIDRGPDSVG
VLDYYQTHDWFEAVMGNHEWMMVNALDAQNKLERSEKEAYFIKIWHRNGCEWSQNL
>ID_02 IMEHDJCA_03187 Serine transporter
MKESRETLNFSDTLPTETWTKHDTHWVLSLFGTAVGAGILFLPINLGIGGFWPLVLLALL
AFPMTFWGHRALARFVLSSKQADADFTDVVEEHFGAKAGRLISLLYFLSIFPILLIYGVG
>ID_03 IMEHDJCA_03189 hypothetical protein
MNNQRHGITFGIERIGSQTILVFKATGTLTHQDYQAIAPVLEAALAGINRQQMNMLADIS
EFSGWEPRAAWDDFQLGLKIGFSVNKVAVYGDKNWQELAAKVGSWFISGEMKSFGD