python 遍历文件时枚举超出范围
python enumerate out of range when looping through a file
我有一个名为 test.txt
的路径文件
/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001G_1_Clean.fastq.gz
/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001G_2_Clean.fastq.gz
/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001T_1_Clean.fastq.gz
/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001T_2_Clean.fastq.gz
/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002G_1_Clean.fastq.gz
/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002G_2_Clean.fastq.gz
/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002T_1_Clean.fastq.gz
/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002T_2_Clean.fastq.gz
请注意,行数是偶数且始终是偶数,我的最终目标是解析此文件并创建一个新文件,以两两方式循环遍历这些路径。我正在尝试 enumerate
函数,但这不会两个两个地解析。此外,我超出了范围,因为索引我正在做的方式是错误的。如果有人可以告诉我如何使用 enumerate
.
正确索引,那也很棒
with open('./src/test.txt') as f:
for index,line in enumerate(f):
sample = re.search(r'pfg[\dGT]+',line)
sample_string = sample.group(0)
#print(sample_string)
print('{{"name":"{0}","readgroup":"{0}","platform_unit":"{0}","fastq_1":"{1}","fastq_2":"{2}","library":"{0}"}},'.format(sample_string,line,line[index+1]))
结果是这样的:
{"name":"pfg001G","readgroup":"pfg001G","platform_unit":"pfg001G","fastq_1":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001G_1_Clean.fastq.gz
","fastq_2":"g","library":"pfg001G"},
{"name":"pfg001G","readgroup":"pfg001G","platform_unit":"pfg001G","fastq_1":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001G_2_Clean.fastq.gz
","fastq_2":"r","library":"pfg001G"},
{"name":"pfg001T","readgroup":"pfg001T","platform_unit":"pfg001T","fastq_1":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001T_1_Clean.fastq.gz
","fastq_2":"o","library":"pfg001T"},
{"name":"pfg001T","readgroup":"pfg001T","platform_unit":"pfg001T","fastq_1":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001T_2_Clean.fastq.gz
","fastq_2":"u","library":"pfg001T"},
{"name":"pfg002G","readgroup":"pfg002G","platform_unit":"pfg002G","fastq_1":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002G_1_Clean.fastq.gz
","fastq_2":"p","library":"pfg002G"},
{"name":"pfg002G","readgroup":"pfg002G","platform_unit":"pfg002G","fastq_1":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002G_2_Clean.fastq.gz
","fastq_2":"s","library":"pfg002G"},
{"name":"pfg002T","readgroup":"pfg002T","platform_unit":"pfg002T","fastq_1":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002T_1_Clean.fastq.gz
","fastq_2":"/","library":"pfg002T"},
{"name":"pfg002T","readgroup":"pfg002T","platform_unit":"pfg002T","fastq_1":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002T_2_Clean.fastq.gz","fastq_2":"c","library":"pfg002T"},
显然索引是错误的,因为它遍历了我路径中的每个元素,即 g
r
等,而不是打印下一个路径。对于第一次迭代,打印的下一个路径应该是:"fastq_2":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001G_2_Clean.fastq.gz"
.
我相信问题本身可以用 itertools
更优雅地解决,我只是不知道该怎么做。如果有人可以告诉我枚举索引是否也可以工作,那也很棒。
一个问题是您在阅读数据对之前试图从第二行访问数据。此外,您不能使用 line[index + 1]
访问第二行,因为它指的是当前行中的一个字符,而不是尚未读取的下一行。
因此您需要跟踪成对的行。您可以使用 enumerate()
提供的索引来确定当前行是第一行(因为它是偶数)还是第二行(因为它是奇数)。当您阅读第一行时,存储 fastq_1 的名称和路径。只在第二行写输出。像这样:
重新导入
with open('test.txt') as f:
for index, line in enumerate(f):
if index % 2 == 0: # even, so this is the first line of a pair
name = re.search(r'pfg[\dGT]+',line).group(0)
fastq_1 = line.rstrip()
else: # odd, so second line. Emit result
fastq_2 = line.rstrip()
print('{{"name":"{0}","readgroup":"{0}","platform_unit":"{0}","fastq_1":"{1}","fastq_2":"{2}","library":"{0}"}},'.format(name, fastq_1, fastq_2))
line.rstrip()
需要删除每行末尾的尾随换行符。
您可能需要考虑,当您将索引分配给变量时,您得到的是该字符串的索引字符而不是它的索引。
你可以做的是将文件分配给一个列表,然后获取索引位置,这样你就可以根据需要在行之间切换。
还是不明白点,你是想在fastq_1
和fastq_2
两行之间切换还是每条路径都根据它的key?
代码语法
with open(path) as f:
lis = list(f)
for index, line in enumerate(lis):
try:
sample = re.search(r'pfg[\dGT]+',line)
sample_string = sample.group(0)
print(f'{{"name":"{sample_string}","readgroup":"{sample_string}","platform_unit":"{sample_string}","fastq_1":"{line}","fastq_2":"{lis[index+1]}","library":"{sample_string}"}},')
except IndexError:
break
输出
{"name":"pfg001G","readgroup":"pfg001G","platform_unit":"pf
g001G","fastq_1":"/groups/cgsd/javed/validation_set/Leung
SY_Targeted_SS-190528-01a/Clean/pfg001G_1_Clean.fastq.gz
","fastq_2":"/groups/cgsd/javed/validation_set/LeungSY_Ta
rgeted_SS-190528-01a/Clean/pfg001G_2_Clean.fastq.gz
","library":"pfg001G"},
{"name":"pfg001G","readgroup":"pfg001G","platform_unit":"pf
g001G","fastq_1":"/groups/cgsd/javed/validation_set/Leung
SY_Targeted_SS-190528-01a/Clean/pfg001G_2_Clean.fastq.gz
","fastq_2":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001T_1_Clean.fastq.gz
","library":"pfg001G"},
{"name":"pfg001T","readgroup":"pfg001T","platform_unit":"pf
g001T","fastq_1":"/groups/cgsd/javed/validation_set/Leung
SY_Targeted_SS-190528-01a/Clean/pfg001T_1_Clean.fastq.gz
","fastq_2":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001T_2_Clean.fastq.gz
","library":"pfg001T"},
{"name":"pfg001T","readgroup":"pfg001T","platform_unit":"pf
g001T","fastq_1":"/groups/cgsd/javed/validation_set/Leung
SY_Targeted_SS-190528-01a/Clean/pfg001T_2_Clean.fastq.gz
","fastq_2":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002G_1_Clean.fastq.gz
","library":"pfg001T"},
{"name":"pfg002G","readgroup":"pfg002G","platform_unit":"pf
g002G","fastq_1":"/groups/cgsd/javed/validation_set/Leung
SY_Targeted_SS-190528-01a/Clean/pfg002G_1_Clean.fastq.gz
","fastq_2":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002G_2_Clean.fastq.gz
","library":"pfg002G"},
{"name":"pfg002G","readgroup":"pfg002G","platform_unit":"pf
g002G","fastq_1":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002G_2_Clean.fastq.gz
","fastq_2":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002T_1_Clean.fastq.gz
","library":"pfg002G"},
{"name":"pfg002T","readgroup":"pfg002T","platform_unit":"pfg002T","fastq_1":"/groups/cgsd/javed/validation_set/Leung
SY_Targeted_SS-190528-01a/Clean/pfg002T_1_Clean.fastq.gz
","fastq_2":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002T_2_Clean.fastq.gz
","library":"pfg002T"},
[Program finished]
@mhawke 已经提供了一个很好的解决方案,但要给出另一种方法,“循环遍历这些......在两个基础上”可以使用 Python 中的 more_itertools.chunked
function from the more_itertools library or with the grouper()
recipe 来完成手册。
这还提供了当最后一行是奇数行时应该发生什么的选项;是否应该引发错误或将其与默认值配对。
我有一个名为 test.txt
/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001G_1_Clean.fastq.gz
/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001G_2_Clean.fastq.gz
/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001T_1_Clean.fastq.gz
/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001T_2_Clean.fastq.gz
/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002G_1_Clean.fastq.gz
/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002G_2_Clean.fastq.gz
/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002T_1_Clean.fastq.gz
/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002T_2_Clean.fastq.gz
请注意,行数是偶数且始终是偶数,我的最终目标是解析此文件并创建一个新文件,以两两方式循环遍历这些路径。我正在尝试 enumerate
函数,但这不会两个两个地解析。此外,我超出了范围,因为索引我正在做的方式是错误的。如果有人可以告诉我如何使用 enumerate
.
with open('./src/test.txt') as f:
for index,line in enumerate(f):
sample = re.search(r'pfg[\dGT]+',line)
sample_string = sample.group(0)
#print(sample_string)
print('{{"name":"{0}","readgroup":"{0}","platform_unit":"{0}","fastq_1":"{1}","fastq_2":"{2}","library":"{0}"}},'.format(sample_string,line,line[index+1]))
结果是这样的:
{"name":"pfg001G","readgroup":"pfg001G","platform_unit":"pfg001G","fastq_1":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001G_1_Clean.fastq.gz
","fastq_2":"g","library":"pfg001G"},
{"name":"pfg001G","readgroup":"pfg001G","platform_unit":"pfg001G","fastq_1":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001G_2_Clean.fastq.gz
","fastq_2":"r","library":"pfg001G"},
{"name":"pfg001T","readgroup":"pfg001T","platform_unit":"pfg001T","fastq_1":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001T_1_Clean.fastq.gz
","fastq_2":"o","library":"pfg001T"},
{"name":"pfg001T","readgroup":"pfg001T","platform_unit":"pfg001T","fastq_1":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001T_2_Clean.fastq.gz
","fastq_2":"u","library":"pfg001T"},
{"name":"pfg002G","readgroup":"pfg002G","platform_unit":"pfg002G","fastq_1":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002G_1_Clean.fastq.gz
","fastq_2":"p","library":"pfg002G"},
{"name":"pfg002G","readgroup":"pfg002G","platform_unit":"pfg002G","fastq_1":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002G_2_Clean.fastq.gz
","fastq_2":"s","library":"pfg002G"},
{"name":"pfg002T","readgroup":"pfg002T","platform_unit":"pfg002T","fastq_1":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002T_1_Clean.fastq.gz
","fastq_2":"/","library":"pfg002T"},
{"name":"pfg002T","readgroup":"pfg002T","platform_unit":"pfg002T","fastq_1":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002T_2_Clean.fastq.gz","fastq_2":"c","library":"pfg002T"},
显然索引是错误的,因为它遍历了我路径中的每个元素,即 g
r
等,而不是打印下一个路径。对于第一次迭代,打印的下一个路径应该是:"fastq_2":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001G_2_Clean.fastq.gz"
.
我相信问题本身可以用 itertools
更优雅地解决,我只是不知道该怎么做。如果有人可以告诉我枚举索引是否也可以工作,那也很棒。
一个问题是您在阅读数据对之前试图从第二行访问数据。此外,您不能使用 line[index + 1]
访问第二行,因为它指的是当前行中的一个字符,而不是尚未读取的下一行。
因此您需要跟踪成对的行。您可以使用 enumerate()
提供的索引来确定当前行是第一行(因为它是偶数)还是第二行(因为它是奇数)。当您阅读第一行时,存储 fastq_1 的名称和路径。只在第二行写输出。像这样:
重新导入
with open('test.txt') as f:
for index, line in enumerate(f):
if index % 2 == 0: # even, so this is the first line of a pair
name = re.search(r'pfg[\dGT]+',line).group(0)
fastq_1 = line.rstrip()
else: # odd, so second line. Emit result
fastq_2 = line.rstrip()
print('{{"name":"{0}","readgroup":"{0}","platform_unit":"{0}","fastq_1":"{1}","fastq_2":"{2}","library":"{0}"}},'.format(name, fastq_1, fastq_2))
line.rstrip()
需要删除每行末尾的尾随换行符。
您可能需要考虑,当您将索引分配给变量时,您得到的是该字符串的索引字符而不是它的索引。
你可以做的是将文件分配给一个列表,然后获取索引位置,这样你就可以根据需要在行之间切换。
还是不明白点,你是想在fastq_1
和fastq_2
两行之间切换还是每条路径都根据它的key?
代码语法
with open(path) as f:
lis = list(f)
for index, line in enumerate(lis):
try:
sample = re.search(r'pfg[\dGT]+',line)
sample_string = sample.group(0)
print(f'{{"name":"{sample_string}","readgroup":"{sample_string}","platform_unit":"{sample_string}","fastq_1":"{line}","fastq_2":"{lis[index+1]}","library":"{sample_string}"}},')
except IndexError:
break
输出
{"name":"pfg001G","readgroup":"pfg001G","platform_unit":"pf
g001G","fastq_1":"/groups/cgsd/javed/validation_set/Leung
SY_Targeted_SS-190528-01a/Clean/pfg001G_1_Clean.fastq.gz
","fastq_2":"/groups/cgsd/javed/validation_set/LeungSY_Ta
rgeted_SS-190528-01a/Clean/pfg001G_2_Clean.fastq.gz
","library":"pfg001G"},
{"name":"pfg001G","readgroup":"pfg001G","platform_unit":"pf
g001G","fastq_1":"/groups/cgsd/javed/validation_set/Leung
SY_Targeted_SS-190528-01a/Clean/pfg001G_2_Clean.fastq.gz
","fastq_2":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001T_1_Clean.fastq.gz
","library":"pfg001G"},
{"name":"pfg001T","readgroup":"pfg001T","platform_unit":"pf
g001T","fastq_1":"/groups/cgsd/javed/validation_set/Leung
SY_Targeted_SS-190528-01a/Clean/pfg001T_1_Clean.fastq.gz
","fastq_2":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001T_2_Clean.fastq.gz
","library":"pfg001T"},
{"name":"pfg001T","readgroup":"pfg001T","platform_unit":"pf
g001T","fastq_1":"/groups/cgsd/javed/validation_set/Leung
SY_Targeted_SS-190528-01a/Clean/pfg001T_2_Clean.fastq.gz
","fastq_2":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002G_1_Clean.fastq.gz
","library":"pfg001T"},
{"name":"pfg002G","readgroup":"pfg002G","platform_unit":"pf
g002G","fastq_1":"/groups/cgsd/javed/validation_set/Leung
SY_Targeted_SS-190528-01a/Clean/pfg002G_1_Clean.fastq.gz
","fastq_2":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002G_2_Clean.fastq.gz
","library":"pfg002G"},
{"name":"pfg002G","readgroup":"pfg002G","platform_unit":"pf
g002G","fastq_1":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002G_2_Clean.fastq.gz
","fastq_2":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002T_1_Clean.fastq.gz
","library":"pfg002G"},
{"name":"pfg002T","readgroup":"pfg002T","platform_unit":"pfg002T","fastq_1":"/groups/cgsd/javed/validation_set/Leung
SY_Targeted_SS-190528-01a/Clean/pfg002T_1_Clean.fastq.gz
","fastq_2":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002T_2_Clean.fastq.gz
","library":"pfg002T"},
[Program finished]
@mhawke 已经提供了一个很好的解决方案,但要给出另一种方法,“循环遍历这些......在两个基础上”可以使用 Python 中的 more_itertools.chunked
function from the more_itertools library or with the grouper()
recipe 来完成手册。
这还提供了当最后一行是奇数行时应该发生什么的选项;是否应该引发错误或将其与默认值配对。