根据另一个文件中的逗号分隔字符串提取行并将提取的行写入文件

Extracting lines based on comma-separated string in another file and write extracted lines to file

我有一个制表符分隔的文件,在第一列中列出了 genomeID 及其相应的 contigIDs。 contigID 在第二列中以逗号分隔(如下例)

424182.1        H|S1|C933685,H|S1|C449562,H|S1|C172291,H|S1|C1169825
1217675.1       H|S1|C1168525,H|S1|C573086,H|S1|C357867,H|S1|C85072,H|S1|C965427,H|S1|C1724718
585503.1        H|S1|C874141,H|S1|C529585

我有另一个名为 SAMPLE.fasta 的文件,其中包含 contigID 和下一行中每个 contigID 的相应序列(如下例)

>H|S1|C933685
GAAAGTTCTTGACCTGTGGACAGGCTGTGAATCGGGTTGGACAAGT
>H|S1|C85072
GGAAACGGCTGCTGCCATCCTTGCCCTTCGCCCAAG
>H|S1|C965427
CTCAAGAAATTCGGTATCACCGGTAACTATGAGGCAGTCGAGGTCG
etc...
etc...
etc..

根据这些信息,我想为每个基因组 ID 创建一个 separate file(下面的示例)

Output_file: 424182.1.fasta

>H|S1|C933685
GAAAGTTCTTGACCTGTGGACAGGCTGTGAATCGGGTTGGACAAGT

Output_file: 1217675.1.fasta

>H|S1|C85072
GGAAACGGCTGCTGCCATCCTTGCCCTTCGCCCAAG
>H|S1|C965427
CTCAAGAAATTCGGTATCACCGGTAACTATGAGGCAGTCGAGGTCG

如有任何帮助,我们将不胜感激 awk and/or pandas。预先感谢您对此的帮助!

尝试:

import pandas as pd

# STEP-1: load sample data and create a Series
data = {}
with open('SAMPLE.fasta') as fp:
    for line in fp:
        if line.startswith('>'):
            id_ = line[1:].strip()
        else:
            data[id_] = line.strip()
sr = pd.Series(data)

# STEP-2: load the list of genome id and create a DataFrame
df = pd.read_table('data.tsv', header=None, names=['genomeID', 'contigIDs'])
df = df.assign(contigIDs=df['contigIDs'].str.split(',')).explode('contigIDs')

# STEP-3: map your series with your dataframe
df = df.assign(Seq=df['contigIDs'].map(sr)).dropna()

# STEP-4: create your files
for filename, df1 in df.groupby('genomeID'):
    with open(f"{filename}.fasta", 'w') as fp:
        for _, row in df1.iterrows():
            fp.write(f">{row['contigIDs']}\n{row['Seq']}\n")

输出:

# Content of 424182.1.fasta
>H|S1|C933685
GAAAGTTCTTGACCTGTGGACAGGCTGTGAATCGGGTTGGACAAGT

# Content of 1217675.1.fasta
>H|S1|C85072
GGAAACGGCTGCTGCCATCCTTGCCCTTCGCCCAAG
>H|S1|C965427
CTCAAGAAATTCGGTATCACCGGTAACTATGAGGCAGTCGAGGTCG

假设序列数据以单行结尾(没有延伸到多行),awk 解决方案如何:

awk -F'\t' '
    NR==FNR {                                   # process SAMPLE.fasta file
        if (FNR % 2) {                          # odd line with contigID
            len = split([=10=], a, "|")             # extract the contigID
            id = a[len]
            seq[id] = [=10=]                        # assign seq[id] to the line
        } else {                                # even line with sequence
            seq[id] = seq[id] RS [=10=]             # append sequence to seq[id]
        }
        next
    }
    {                                           # process contigIDs file
        fname =  ".fasta"                     # filename to write
        len = split(, a, ",")                 # split the contigIDs
        for (i = 1; i <= len; i++) {
            split(a[i], b, "|")                 # extract the contigID
            if (b[3] in seq) {                  # if the sequence is found
                print seq[b[3]] > fname         # then print it to the file
            }
        }
        close(fname)
    }
' SAMPLE.fasta contigIDs

输出:

424182.1.fasta file:
>H|S1|C933685
GAAAGTTCTTGACCTGTGGACAGGCTGTGAATCGGGTTGGACAAGT

1217675.1.fasta file:
>H|S1|C85072
GGAAACGGCTGCTGCCATCCTTGCCCTTCGCCCAAG
>H|S1|C965427
CTCAAGAAATTCGGTATCACCGGTAACTATGAGGCAGTCGAGGTCG

我的一位同事使用 biopython and pandas 提供了一个解决方案,因此为了完整起见,我将其张贴在这里..

from Bio import SeqIO   # require biopython>=1.7.9
import pandas as pd 

genomes_l = pd.read_csv('test_data.tsv', sep='\t', header=None, names=['genomeID', 'contigID'])

for i, r in genomes_l.iterrows():
    genome_name = r['genomeID']
    contig_ids = r['contigID'].split(',')
    genome_contigs = [rec for rec in SeqIO.parse('SAMPLE.fasta', 'fasta') if rec.id in contig_ids]
    with open(f'out_dir/{genome_name}.fasta', 'w') as handle:
        SeqIO.write(genome_contigs, handle, 'fasta')

谢谢大家的帮助!