根据另一个文件中的逗号分隔字符串提取行并将提取的行写入文件
Extracting lines based on comma-separated string in another file and write extracted lines to file
我有一个制表符分隔的文件,在第一列中列出了 genomeID
及其相应的 contigIDs
。 contigID 在第二列中以逗号分隔(如下例)
424182.1 H|S1|C933685,H|S1|C449562,H|S1|C172291,H|S1|C1169825
1217675.1 H|S1|C1168525,H|S1|C573086,H|S1|C357867,H|S1|C85072,H|S1|C965427,H|S1|C1724718
585503.1 H|S1|C874141,H|S1|C529585
我有另一个名为 SAMPLE.fasta
的文件,其中包含 contigID 和下一行中每个 contigID 的相应序列(如下例)
>H|S1|C933685
GAAAGTTCTTGACCTGTGGACAGGCTGTGAATCGGGTTGGACAAGT
>H|S1|C85072
GGAAACGGCTGCTGCCATCCTTGCCCTTCGCCCAAG
>H|S1|C965427
CTCAAGAAATTCGGTATCACCGGTAACTATGAGGCAGTCGAGGTCG
etc...
etc...
etc..
根据这些信息,我想为每个基因组 ID 创建一个 separate file
(下面的示例)
Output_file: 424182.1.fasta
>H|S1|C933685
GAAAGTTCTTGACCTGTGGACAGGCTGTGAATCGGGTTGGACAAGT
Output_file: 1217675.1.fasta
>H|S1|C85072
GGAAACGGCTGCTGCCATCCTTGCCCTTCGCCCAAG
>H|S1|C965427
CTCAAGAAATTCGGTATCACCGGTAACTATGAGGCAGTCGAGGTCG
如有任何帮助,我们将不胜感激 awk and/or pandas
。预先感谢您对此的帮助!
尝试:
import pandas as pd
# STEP-1: load sample data and create a Series
data = {}
with open('SAMPLE.fasta') as fp:
for line in fp:
if line.startswith('>'):
id_ = line[1:].strip()
else:
data[id_] = line.strip()
sr = pd.Series(data)
# STEP-2: load the list of genome id and create a DataFrame
df = pd.read_table('data.tsv', header=None, names=['genomeID', 'contigIDs'])
df = df.assign(contigIDs=df['contigIDs'].str.split(',')).explode('contigIDs')
# STEP-3: map your series with your dataframe
df = df.assign(Seq=df['contigIDs'].map(sr)).dropna()
# STEP-4: create your files
for filename, df1 in df.groupby('genomeID'):
with open(f"{filename}.fasta", 'w') as fp:
for _, row in df1.iterrows():
fp.write(f">{row['contigIDs']}\n{row['Seq']}\n")
输出:
# Content of 424182.1.fasta
>H|S1|C933685
GAAAGTTCTTGACCTGTGGACAGGCTGTGAATCGGGTTGGACAAGT
# Content of 1217675.1.fasta
>H|S1|C85072
GGAAACGGCTGCTGCCATCCTTGCCCTTCGCCCAAG
>H|S1|C965427
CTCAAGAAATTCGGTATCACCGGTAACTATGAGGCAGTCGAGGTCG
假设序列数据以单行结尾(没有延伸到多行),awk 解决方案如何:
awk -F'\t' '
NR==FNR { # process SAMPLE.fasta file
if (FNR % 2) { # odd line with contigID
len = split([=10=], a, "|") # extract the contigID
id = a[len]
seq[id] = [=10=] # assign seq[id] to the line
} else { # even line with sequence
seq[id] = seq[id] RS [=10=] # append sequence to seq[id]
}
next
}
{ # process contigIDs file
fname = ".fasta" # filename to write
len = split(, a, ",") # split the contigIDs
for (i = 1; i <= len; i++) {
split(a[i], b, "|") # extract the contigID
if (b[3] in seq) { # if the sequence is found
print seq[b[3]] > fname # then print it to the file
}
}
close(fname)
}
' SAMPLE.fasta contigIDs
输出:
424182.1.fasta file:
>H|S1|C933685
GAAAGTTCTTGACCTGTGGACAGGCTGTGAATCGGGTTGGACAAGT
1217675.1.fasta file:
>H|S1|C85072
GGAAACGGCTGCTGCCATCCTTGCCCTTCGCCCAAG
>H|S1|C965427
CTCAAGAAATTCGGTATCACCGGTAACTATGAGGCAGTCGAGGTCG
我的一位同事使用 biopython and pandas
提供了一个解决方案,因此为了完整起见,我将其张贴在这里..
from Bio import SeqIO # require biopython>=1.7.9
import pandas as pd
genomes_l = pd.read_csv('test_data.tsv', sep='\t', header=None, names=['genomeID', 'contigID'])
for i, r in genomes_l.iterrows():
genome_name = r['genomeID']
contig_ids = r['contigID'].split(',')
genome_contigs = [rec for rec in SeqIO.parse('SAMPLE.fasta', 'fasta') if rec.id in contig_ids]
with open(f'out_dir/{genome_name}.fasta', 'w') as handle:
SeqIO.write(genome_contigs, handle, 'fasta')
谢谢大家的帮助!
我有一个制表符分隔的文件,在第一列中列出了 genomeID
及其相应的 contigIDs
。 contigID 在第二列中以逗号分隔(如下例)
424182.1 H|S1|C933685,H|S1|C449562,H|S1|C172291,H|S1|C1169825
1217675.1 H|S1|C1168525,H|S1|C573086,H|S1|C357867,H|S1|C85072,H|S1|C965427,H|S1|C1724718
585503.1 H|S1|C874141,H|S1|C529585
我有另一个名为 SAMPLE.fasta
的文件,其中包含 contigID 和下一行中每个 contigID 的相应序列(如下例)
>H|S1|C933685
GAAAGTTCTTGACCTGTGGACAGGCTGTGAATCGGGTTGGACAAGT
>H|S1|C85072
GGAAACGGCTGCTGCCATCCTTGCCCTTCGCCCAAG
>H|S1|C965427
CTCAAGAAATTCGGTATCACCGGTAACTATGAGGCAGTCGAGGTCG
etc...
etc...
etc..
根据这些信息,我想为每个基因组 ID 创建一个 separate file
(下面的示例)
Output_file: 424182.1.fasta
>H|S1|C933685
GAAAGTTCTTGACCTGTGGACAGGCTGTGAATCGGGTTGGACAAGT
Output_file: 1217675.1.fasta
>H|S1|C85072
GGAAACGGCTGCTGCCATCCTTGCCCTTCGCCCAAG
>H|S1|C965427
CTCAAGAAATTCGGTATCACCGGTAACTATGAGGCAGTCGAGGTCG
如有任何帮助,我们将不胜感激 awk and/or pandas
。预先感谢您对此的帮助!
尝试:
import pandas as pd
# STEP-1: load sample data and create a Series
data = {}
with open('SAMPLE.fasta') as fp:
for line in fp:
if line.startswith('>'):
id_ = line[1:].strip()
else:
data[id_] = line.strip()
sr = pd.Series(data)
# STEP-2: load the list of genome id and create a DataFrame
df = pd.read_table('data.tsv', header=None, names=['genomeID', 'contigIDs'])
df = df.assign(contigIDs=df['contigIDs'].str.split(',')).explode('contigIDs')
# STEP-3: map your series with your dataframe
df = df.assign(Seq=df['contigIDs'].map(sr)).dropna()
# STEP-4: create your files
for filename, df1 in df.groupby('genomeID'):
with open(f"{filename}.fasta", 'w') as fp:
for _, row in df1.iterrows():
fp.write(f">{row['contigIDs']}\n{row['Seq']}\n")
输出:
# Content of 424182.1.fasta
>H|S1|C933685
GAAAGTTCTTGACCTGTGGACAGGCTGTGAATCGGGTTGGACAAGT
# Content of 1217675.1.fasta
>H|S1|C85072
GGAAACGGCTGCTGCCATCCTTGCCCTTCGCCCAAG
>H|S1|C965427
CTCAAGAAATTCGGTATCACCGGTAACTATGAGGCAGTCGAGGTCG
假设序列数据以单行结尾(没有延伸到多行),awk 解决方案如何:
awk -F'\t' '
NR==FNR { # process SAMPLE.fasta file
if (FNR % 2) { # odd line with contigID
len = split([=10=], a, "|") # extract the contigID
id = a[len]
seq[id] = [=10=] # assign seq[id] to the line
} else { # even line with sequence
seq[id] = seq[id] RS [=10=] # append sequence to seq[id]
}
next
}
{ # process contigIDs file
fname = ".fasta" # filename to write
len = split(, a, ",") # split the contigIDs
for (i = 1; i <= len; i++) {
split(a[i], b, "|") # extract the contigID
if (b[3] in seq) { # if the sequence is found
print seq[b[3]] > fname # then print it to the file
}
}
close(fname)
}
' SAMPLE.fasta contigIDs
输出:
424182.1.fasta file:
>H|S1|C933685
GAAAGTTCTTGACCTGTGGACAGGCTGTGAATCGGGTTGGACAAGT
1217675.1.fasta file:
>H|S1|C85072
GGAAACGGCTGCTGCCATCCTTGCCCTTCGCCCAAG
>H|S1|C965427
CTCAAGAAATTCGGTATCACCGGTAACTATGAGGCAGTCGAGGTCG
我的一位同事使用 biopython and pandas
提供了一个解决方案,因此为了完整起见,我将其张贴在这里..
from Bio import SeqIO # require biopython>=1.7.9
import pandas as pd
genomes_l = pd.read_csv('test_data.tsv', sep='\t', header=None, names=['genomeID', 'contigID'])
for i, r in genomes_l.iterrows():
genome_name = r['genomeID']
contig_ids = r['contigID'].split(',')
genome_contigs = [rec for rec in SeqIO.parse('SAMPLE.fasta', 'fasta') if rec.id in contig_ids]
with open(f'out_dir/{genome_name}.fasta', 'w') as handle:
SeqIO.write(genome_contigs, handle, 'fasta')
谢谢大家的帮助!