从 python 中的文本文件中搜索和记录
Searching and recording from a text file in python
我正在寻找有关我拥有的以下搜索脚本的一些建议。任何帮助都会很棒。
下一行是我的输入(查询)文件的示例 ("out.list.txt")
IVVTGPHKFNRCPLKKLAQSFTMPTSTFVDI*GLNFDITEQHFVKEKP**SSEEAQFFAK
我可以在对齐文件 ("out.test.txt") 中找到这一行和其他 50,000 行并打印输出。
这是路线文件的摘录。
Query_13 388 IVVQADGSQVVEDRKADVMNAAYNALQAGLRTIKVGNTNT*VTEVMNKAIEPFECNMLEG 567
c18644_g2_i1_3 122 LVVGASAETPITGNKADVVLAAYNAIQAALRLIKPGNSNLEVTEVFNKIATDYQCNVLEG 181
c18644_g1_i1_2 121 LVVGATAEAPIAGNKADVTLAAYNAIQAALRLIKPGSTNTEVTQVFNKIAADYHCNVLEG 180
c11476_g1_i1_2 119 VVVQ-DPSAKVTGEKADLLLAALNAMQAALRLVRPGNTNTQVTEAMSKIAEAYGCTMLEG 177
c7710_g1_i1_1 147 IVVSEKADAVVEGRKADVVHAAYNALQVALRLLKPGQKNNDVTEHIAKVVESYKCNPVEG 206
c37_g1_i1_3 145 VVVGKDKSTGAEGRKAEVILAAYNALQASLRHLRPGSKNYDVTETVEKISETFGCNPVEG 204
c2897_g1_i1_3 144 FILGATAENPASGKKADVILAAKQAIDAAVRKIRVGETNLTLTETIARVAAAYGVNSVEG 203
c4999_g1_i1_2 167 VVI---GKEKVDDKRADVVKCAWDAAEAALRLVQVGNTNTQVTEAFTKIADEYGCKPMQG 223
如果查询行包含“*”,是否可以在输出的其他行上记录该位置的内容? IE。 E,E,Q,D,D,T,V
到目前为止所有的尝试都没有成功,我想知道我的尝试是否可行。
seq_list = open("out.list.txt")
query_sequences = []
for sequence in seq_list:
query_sequences.append(seq_list.strip())
seq_list.close()
hits = []
alignments = open("out.test.txt")
for line in alignments:
alignment_hit = line.split()
for query_sequence in query_sequences:
if query_sequence in alignment_hit:
hits.append(line)
break
alignments.close()
sequence = open("out.list.txt").read() # reads in the file as a string
alignment_rows = open("out.test.txt").readlines() # reads in the file as a list of lines
# split each row by tab sign "\t" and extract sequences only - third column
# I assume, you're using tab sign as a separator in your alignment
alignment_sequences = [ row.split("\t")[2] for row in alignment_rows ]
output = {} # this is a dict, where keys are indices of positions with * and values are lists e.g. {1: ['A', 'C'], 2: ['D', 'E']}
for index, char in enumerate(sequence):
if char == "*":
output[index] = []
for alignment_sequence in alignment sequences:
output[index].append(alignment_sequence[index])
如果你只想要对齐序列字符,试试这个(每行也处理多个 *
)
lines = [line.rstrip() for line in open('out.test.txt')]
for line in lines:
data = line.split()
sequence = data[2]
if data[0].startswith("Query"):
star_indicies = [i for i,c in enumerate(sequence) if c == '*']
else:
print(list(sequence[star_index] for star_index in star_indicies))
样本输入的输出
['E']
['E']
['Q']
['D']
['D']
['T']
['Q']
我正在寻找有关我拥有的以下搜索脚本的一些建议。任何帮助都会很棒。
下一行是我的输入(查询)文件的示例 ("out.list.txt")
IVVTGPHKFNRCPLKKLAQSFTMPTSTFVDI*GLNFDITEQHFVKEKP**SSEEAQFFAK
我可以在对齐文件 ("out.test.txt") 中找到这一行和其他 50,000 行并打印输出。 这是路线文件的摘录。
Query_13 388 IVVQADGSQVVEDRKADVMNAAYNALQAGLRTIKVGNTNT*VTEVMNKAIEPFECNMLEG 567
c18644_g2_i1_3 122 LVVGASAETPITGNKADVVLAAYNAIQAALRLIKPGNSNLEVTEVFNKIATDYQCNVLEG 181
c18644_g1_i1_2 121 LVVGATAEAPIAGNKADVTLAAYNAIQAALRLIKPGSTNTEVTQVFNKIAADYHCNVLEG 180
c11476_g1_i1_2 119 VVVQ-DPSAKVTGEKADLLLAALNAMQAALRLVRPGNTNTQVTEAMSKIAEAYGCTMLEG 177
c7710_g1_i1_1 147 IVVSEKADAVVEGRKADVVHAAYNALQVALRLLKPGQKNNDVTEHIAKVVESYKCNPVEG 206
c37_g1_i1_3 145 VVVGKDKSTGAEGRKAEVILAAYNALQASLRHLRPGSKNYDVTETVEKISETFGCNPVEG 204
c2897_g1_i1_3 144 FILGATAENPASGKKADVILAAKQAIDAAVRKIRVGETNLTLTETIARVAAAYGVNSVEG 203
c4999_g1_i1_2 167 VVI---GKEKVDDKRADVVKCAWDAAEAALRLVQVGNTNTQVTEAFTKIADEYGCKPMQG 223
如果查询行包含“*”,是否可以在输出的其他行上记录该位置的内容? IE。 E,E,Q,D,D,T,V
到目前为止所有的尝试都没有成功,我想知道我的尝试是否可行。
seq_list = open("out.list.txt")
query_sequences = []
for sequence in seq_list:
query_sequences.append(seq_list.strip())
seq_list.close()
hits = []
alignments = open("out.test.txt")
for line in alignments:
alignment_hit = line.split()
for query_sequence in query_sequences:
if query_sequence in alignment_hit:
hits.append(line)
break
alignments.close()
sequence = open("out.list.txt").read() # reads in the file as a string
alignment_rows = open("out.test.txt").readlines() # reads in the file as a list of lines
# split each row by tab sign "\t" and extract sequences only - third column
# I assume, you're using tab sign as a separator in your alignment
alignment_sequences = [ row.split("\t")[2] for row in alignment_rows ]
output = {} # this is a dict, where keys are indices of positions with * and values are lists e.g. {1: ['A', 'C'], 2: ['D', 'E']}
for index, char in enumerate(sequence):
if char == "*":
output[index] = []
for alignment_sequence in alignment sequences:
output[index].append(alignment_sequence[index])
如果你只想要对齐序列字符,试试这个(每行也处理多个 *
)
lines = [line.rstrip() for line in open('out.test.txt')]
for line in lines:
data = line.split()
sequence = data[2]
if data[0].startswith("Query"):
star_indicies = [i for i,c in enumerate(sequence) if c == '*']
else:
print(list(sequence[star_index] for star_index in star_indicies))
样本输入的输出
['E']
['E']
['Q']
['D']
['D']
['T']
['Q']