当 Python 有多个分隔符时,如何制作列表列表?
How to make a list of lists in Python when it has multiple separators?
示例文件如下所示(所有内容都在一行中,为便于阅读而换行):
['>1\n', 'TCCGGGGGTATC\n', '>2\n', 'TCCGTGGGTATC\n',
'>3\n', 'TCCGTGGGTATC\n', '>4\n', 'TCCGGGGGTATC\n',
'>5\n', 'TCCGTGGGTATC\n', '>6\n', 'TCCGTGGGTATC\n',
'>7\n', 'TCCGTGGGTATC\n', '>8\n', 'TCCGGGGGTATC\n','\n',
'$$$\n', '\n',
'>B1\n', 'ATCGGGGGTATT\n', '>B2\n', 'TT-GTGGGAATC\n',
'>3\n', 'TTCGTGGGAATC\n', '>B4\n', 'TT-GTGGGTATC\n',
'>B5\n', 'TTCGTGGGTATT\n', '>B6\n','TTCGGGGGTATC\n',
'>B7\n', 'TT-GTGGGTATC\n', '>B8\n', 'TTCGGGGGAATC\n',
'>B9\n', 'TTCGGGGGTATC\n','>B10\n', 'TTCGGGGGTATC\n',
'>B42\n', 'TT-GTGGGTATC\n']
$$$
将两组分开。我需要使用 .strip
函数并删除 \n
和所有 "headers"。
我需要制作一个列表列表(如下所示)并将“-”替换为 Z(同样,所有内容都在一行上;为了便于阅读,此处换行):
[['TCCGGGGGTATC','TCCGTGGGTATC','TCCGTGGGTATC', 'TCCGGGGGTATC',
'TCCGTGGGTATC',CGTGGGTATC','TCCGTGGGTATC', 'TCCGGGGGTATC'],
['ATCGGGGGTATT', 'TT-GTGGGAATC','TTCGTGGGAATC', 'TT-GTGGGTATC',
'TTCGTGGGTATT', 'TTCGGGGGTATC','TT-GTGGGTATC', 'TTCGGGGGAATC',
'TTCGGGGGTATC', 'TTCGGGGGTATC','TT-GTGGGTATC]]
您可以利用 headers(和其他不需要的项目)的较小长度作为过滤掉它们的标准。您首先创建一个包含一个列表的列表,并将通过长度测试的项目附加 到内部列表。
当到达分隔符 '$$$'
时,一个新的子列表被添加到结果列表中,并且再次使用长度测试将剩余的项目添加到这个新的子列表:
lst = ['>1\n', 'TCCGGGGGTATC\n', '>2\n', 'TCCGTGGGTATC\n', '>3\n', 'TCCGTGGGTATC\n', '>4\n', 'TCCGGGGGTATC\n', '>5\n', 'TCCGTGGGTATC\n', '>6\n', 'TCCGTGGGTATC\n', '>7\n', 'TCCGTGGGTATC\n', '>8\n', 'TCCGGGGGTATC\n','\n', '$$$\n', '\n', '>B1\n', 'ATCGGGGGTATT\n', '>B2\n', 'TT-GTGGGAATC\n', '>3\n', 'TTCGTGGGAATC\n', '>B4\n', 'TT-GTGGGTATC\n', '>B5\n', 'TTCGTGGGTATT\n', '>B6\n','TTCGGGGGTATC\n', '>B7\n', 'TT-GTGGGTATC\n', '>B8\n', 'TTCGGGGGAATC\n', '>B9\n', 'TTCGGGGGTATC\n','>B10\n', 'TTCGGGGGTATC\n','>B42\n', 'TT-GTGGGTATC\n']
result = [[]]
for x in lst:
if len(x) > 6:
result[-1].append(x.strip())
if x.startswith('$$$'):
result.append([])
print(result)
# [['TCCGGGGGTATC', 'TCCGTGGGTATC', 'TCCGTGGGTATC', 'TCCGGGGGTATC', 'TCCGTGGGTATC', 'TCCGTGGGTATC', 'TCCGTGGGTATC', 'TCCGGGGGTATC'], ['ATCGGGGGTATT', 'TT-GTGGGAATC', 'TTCGTGGGAATC', 'TT-GTGGGTATC', 'TTCGTGGGTATT', 'TTCGGGGGTATC', 'TT-GTGGGTATC', 'TTCGGGGGAATC', 'TTCGGGGGTATC', 'TTCGGGGGTATC', 'TT-GTGGGTATC']]
这是 Moses Koledoye 的答案的变体,它检查 >
的第一个字符并丢弃所有匹配项和任何空元素。我还包括用 "Z".
替换“-”
lst = ['>1\n', 'TCCGGGGGTATC\n', '>2\n', 'TCCGTGGGTATC\n',
'>3\n', 'TCCGTGGGTATC\n', '>4\n', 'TCCGGGGGTATC\n',
'>5\n', 'TCCGTGGGTATC\n', '>6\n', 'TCCGTGGGTATC\n',
'>7\n', 'TCCGTGGGTATC\n', '>8\n', 'TCCGGGGGTATC\n','\n',
'$$$\n', '\n',
'>B1\n', 'ATCGGGGGTATT\n', '>B2\n', 'TT-GTGGGAATC\n',
'>3\n', 'TTCGTGGGAATC\n', '>B4\n', 'TT-GTGGGTATC\n',
'>B5\n', 'TTCGTGGGTATT\n', '>B6\n','TTCGGGGGTATC\n',
'>B7\n', 'TT-GTGGGTATC\n', '>B8\n', 'TTCGGGGGAATC\n',
'>B9\n', 'TTCGGGGGTATC\n','>B10\n', 'TTCGGGGGTATC\n',
'>B42\n', 'TT-GTGGGTATC\n']
result = [[]]
for x in lst:
if x.startswith('>'):
continue
if x.startswith('$$$'):
result.append([])
continue
x = x.strip()
if x:
result[-1].append(x.replace("-", "Z"))
print(result)
这避免了对任何元素的长度赋予任何特殊意义。
示例文件如下所示(所有内容都在一行中,为便于阅读而换行):
['>1\n', 'TCCGGGGGTATC\n', '>2\n', 'TCCGTGGGTATC\n',
'>3\n', 'TCCGTGGGTATC\n', '>4\n', 'TCCGGGGGTATC\n',
'>5\n', 'TCCGTGGGTATC\n', '>6\n', 'TCCGTGGGTATC\n',
'>7\n', 'TCCGTGGGTATC\n', '>8\n', 'TCCGGGGGTATC\n','\n',
'$$$\n', '\n',
'>B1\n', 'ATCGGGGGTATT\n', '>B2\n', 'TT-GTGGGAATC\n',
'>3\n', 'TTCGTGGGAATC\n', '>B4\n', 'TT-GTGGGTATC\n',
'>B5\n', 'TTCGTGGGTATT\n', '>B6\n','TTCGGGGGTATC\n',
'>B7\n', 'TT-GTGGGTATC\n', '>B8\n', 'TTCGGGGGAATC\n',
'>B9\n', 'TTCGGGGGTATC\n','>B10\n', 'TTCGGGGGTATC\n',
'>B42\n', 'TT-GTGGGTATC\n']
$$$
将两组分开。我需要使用 .strip
函数并删除 \n
和所有 "headers"。
我需要制作一个列表列表(如下所示)并将“-”替换为 Z(同样,所有内容都在一行上;为了便于阅读,此处换行):
[['TCCGGGGGTATC','TCCGTGGGTATC','TCCGTGGGTATC', 'TCCGGGGGTATC',
'TCCGTGGGTATC',CGTGGGTATC','TCCGTGGGTATC', 'TCCGGGGGTATC'],
['ATCGGGGGTATT', 'TT-GTGGGAATC','TTCGTGGGAATC', 'TT-GTGGGTATC',
'TTCGTGGGTATT', 'TTCGGGGGTATC','TT-GTGGGTATC', 'TTCGGGGGAATC',
'TTCGGGGGTATC', 'TTCGGGGGTATC','TT-GTGGGTATC]]
您可以利用 headers(和其他不需要的项目)的较小长度作为过滤掉它们的标准。您首先创建一个包含一个列表的列表,并将通过长度测试的项目附加 到内部列表。
当到达分隔符 '$$$'
时,一个新的子列表被添加到结果列表中,并且再次使用长度测试将剩余的项目添加到这个新的子列表:
lst = ['>1\n', 'TCCGGGGGTATC\n', '>2\n', 'TCCGTGGGTATC\n', '>3\n', 'TCCGTGGGTATC\n', '>4\n', 'TCCGGGGGTATC\n', '>5\n', 'TCCGTGGGTATC\n', '>6\n', 'TCCGTGGGTATC\n', '>7\n', 'TCCGTGGGTATC\n', '>8\n', 'TCCGGGGGTATC\n','\n', '$$$\n', '\n', '>B1\n', 'ATCGGGGGTATT\n', '>B2\n', 'TT-GTGGGAATC\n', '>3\n', 'TTCGTGGGAATC\n', '>B4\n', 'TT-GTGGGTATC\n', '>B5\n', 'TTCGTGGGTATT\n', '>B6\n','TTCGGGGGTATC\n', '>B7\n', 'TT-GTGGGTATC\n', '>B8\n', 'TTCGGGGGAATC\n', '>B9\n', 'TTCGGGGGTATC\n','>B10\n', 'TTCGGGGGTATC\n','>B42\n', 'TT-GTGGGTATC\n']
result = [[]]
for x in lst:
if len(x) > 6:
result[-1].append(x.strip())
if x.startswith('$$$'):
result.append([])
print(result)
# [['TCCGGGGGTATC', 'TCCGTGGGTATC', 'TCCGTGGGTATC', 'TCCGGGGGTATC', 'TCCGTGGGTATC', 'TCCGTGGGTATC', 'TCCGTGGGTATC', 'TCCGGGGGTATC'], ['ATCGGGGGTATT', 'TT-GTGGGAATC', 'TTCGTGGGAATC', 'TT-GTGGGTATC', 'TTCGTGGGTATT', 'TTCGGGGGTATC', 'TT-GTGGGTATC', 'TTCGGGGGAATC', 'TTCGGGGGTATC', 'TTCGGGGGTATC', 'TT-GTGGGTATC']]
这是 Moses Koledoye 的答案的变体,它检查 >
的第一个字符并丢弃所有匹配项和任何空元素。我还包括用 "Z".
lst = ['>1\n', 'TCCGGGGGTATC\n', '>2\n', 'TCCGTGGGTATC\n',
'>3\n', 'TCCGTGGGTATC\n', '>4\n', 'TCCGGGGGTATC\n',
'>5\n', 'TCCGTGGGTATC\n', '>6\n', 'TCCGTGGGTATC\n',
'>7\n', 'TCCGTGGGTATC\n', '>8\n', 'TCCGGGGGTATC\n','\n',
'$$$\n', '\n',
'>B1\n', 'ATCGGGGGTATT\n', '>B2\n', 'TT-GTGGGAATC\n',
'>3\n', 'TTCGTGGGAATC\n', '>B4\n', 'TT-GTGGGTATC\n',
'>B5\n', 'TTCGTGGGTATT\n', '>B6\n','TTCGGGGGTATC\n',
'>B7\n', 'TT-GTGGGTATC\n', '>B8\n', 'TTCGGGGGAATC\n',
'>B9\n', 'TTCGGGGGTATC\n','>B10\n', 'TTCGGGGGTATC\n',
'>B42\n', 'TT-GTGGGTATC\n']
result = [[]]
for x in lst:
if x.startswith('>'):
continue
if x.startswith('$$$'):
result.append([])
continue
x = x.strip()
if x:
result[-1].append(x.replace("-", "Z"))
print(result)
这避免了对任何元素的长度赋予任何特殊意义。