如何填充 TSV 文件中缺失的序列行
How to fill in missing sequence lines in a TSV file
我还是个初学者,所以对于初学者来说,很抱歉这个问题可能有一个明显的答案,也很抱歉代码很乱,但我有几万行的文件。我正在使用某种 window 帧技术来沿着我的文件滑动,因此我需要确保每个 window 都在其中。但是,我的一些输入文件遗漏了某些行,因此我尝试在 Python 中编写代码以添加这些行和我想要的信息,以使文件完整。代码如下所示:
#!/usr/bin/env python
outfile = open ("missing_test.txt", "w")
with open("add_missing.txt", "r") as file:
last_line = 0 #This is where it starts for bin 1
lines = []
header_line = next(file)
outfile.write(header_line)
CHROM = 'BABA_1'
for line in file: #go through every line to check its existence and rewrite to new file
nums = line.split("\t")
num1 = nums[0] #no integer because this is a string: name individual
num2 = int(nums[1]) #integer for window
num3 = int(nums[2]) #integer for coverage (here always 10000 to met treshold)
num4 = int(nums[3]) #integer for SNP count
if num1 == CHROM: #
while num2 != last_line + 10000:
#A line is missing, so a new line is added with 0 SNPs:
NUM2 = last_line + 10000 # New window, the one that was missing
NUM4 = 0 #0 SNPs found
#lines.append((num1, NUM2, num3, NUM4))
OUTLINE = "%s\t%s\t%s\t%s" % (num1, NUM2, num3, NUM4) #write new line to outfile
outfile.write(OUTLINE + "\n")
last_line += 10000
lines.append((num1,num2,num3,num4))
last_line += 10000 #also add 10000 here otherwise the while loop makes no sense
outline = "%s\t%s\t%s\t%s" % (num1, num2, num3, num4)
outfile.write(outline + "\n") #write all existing lines to outfile
else:
CHROM = num1
last_line = 0
outfile.close()
因此,只要第一个 "CHROM" 的第一个 window 等于 0,这就可以很好地工作,但情况并非总是如此。在后一种情况下,循环将是无限的。这是例如输入和 DESIRED 输出的样子:
输入:
indiv window coverage SNP
BABA_1 20000 10000 7
BABA_1 30000 10000 1
BABA_1 50000 10000 2
BABA_1 60000 10000 3
BABA_1 80000 10000 1
BABA_10 20000 10000 1
BABA_10 30000 10000 16
BABA_10 80000 10000 9
期望的输出:
indiv window coverage SNP
BABA_1 10000 10000 0
BABA_1 20000 10000 7
BABA_1 30000 10000 1
BABA_1 40000 10000 0
BABA_1 50000 10000 2
BABA_1 60000 10000 3
BABA_1 70000 10000 0
BABA_1 80000 10000 1
BABA_10 10000 10000 0
BABA_10 20000 10000 1
BABA_10 30000 10000 16
BABA_10 40000 10000 0
BABA_10 50000 10000 0
BABA_10 60000 10000 0
BABA_10 70000 10000 0
BABA_10 80000 10000 9
我一直在努力寻找答案来让我的 while 循环工作而不是无限地继续下去,但我真的没有看到我的缺陷。有没有人告诉我如何解决这个问题?
非常感谢任何帮助,提前致谢!
尝试以下方法:
#!/usr/bin/python
outfile = open ("missing_test.txt", "w")
def write_line(indiv, window, coverage, snp):
outline = "%s\t%s\t%s\t%s\n" % (indiv, window, coverage, snp)
outfile.write(outline)
with open("add_missing.txt", "r") as file:
lines = file.readlines()
write_line(*lines.pop(0).rstrip().split("\t"))
first_line = lines[0].split("\t")
last_indiv = first_line[0]
last_window = int(first_line[1])
for line in lines:
indiv, window, coverage, snp = line.split("\t")
window = int(window)
coverage = int(coverage)
snp = int(snp)
if indiv == last_indiv:
# If the current window is higher than expected,
# insert a line with the missing window.
# Repeat until we get to the expected window.
while window > last_window + 10000:
write_line(indiv, last_window + 10000, coverage, 0)
last_window += 10000
last_window = window
else:
last_indiv = indiv
last_window = window
write_line(indiv, window, coverage, snp)
它不包含预期某个 window 数字是给定 indiv
中的第一个数字,因为您没有定义该行为并且您对此的评论相当混乱.
在 运行 这个脚本之后 missing_test.txt 的内容:
indiv window coverage SNP
BABA_1 20000 10000 7
BABA_1 30000 10000 1
BABA_1 40000 10000 0
BABA_1 50000 10000 2
BABA_1 60000 10000 3
BABA_1 70000 10000 0
BABA_1 80000 10000 1
BABA_10 20000 10000 1
BABA_10 30000 10000 16
BABA_10 40000 10000 0
BABA_10 50000 10000 0
BABA_10 60000 10000 0
BABA_10 70000 10000 0
BABA_10 80000 10000 9
您可以使用以下方法,首先构建一个空列表,然后将任何现有条目分配给它,然后再将它们作为行写入输出:
import csv
import itertools
with open('add_missing.txt', 'rb') as f_input, open('missing_test.txt', 'wb') as f_output:
csv_input = csv.reader(f_input, delimiter='\t', skipinitialspace=True)
csv_output = csv.writer(f_output, delimiter='\t')
csv_output.writerow(next(csv_input))
for k, g in itertools.groupby(csv_input, lambda x: x[0]):
empty = [[k, x * 10000, 10000, 0] for x in range(1, 9)]
for row in g:
empty[int(row[1]) / 10000 - 1] = row
csv_output.writerows(empty)
给你:
indiv window coverage SNP
BABA_1 10000 10000 0
BABA_1 20000 10000 7
BABA_1 30000 10000 1
BABA_1 40000 10000 0
BABA_1 50000 10000 2
BABA_1 60000 10000 3
BABA_1 70000 10000 0
BABA_1 80000 10000 1
BABA_10 10000 10000 0
BABA_10 20000 10000 1
BABA_10 30000 10000 16
BABA_10 40000 10000 0
BABA_10 50000 10000 0
BABA_10 60000 10000 0
BABA_10 70000 10000 0
BABA_10 80000 10000 9
我还是个初学者,所以对于初学者来说,很抱歉这个问题可能有一个明显的答案,也很抱歉代码很乱,但我有几万行的文件。我正在使用某种 window 帧技术来沿着我的文件滑动,因此我需要确保每个 window 都在其中。但是,我的一些输入文件遗漏了某些行,因此我尝试在 Python 中编写代码以添加这些行和我想要的信息,以使文件完整。代码如下所示:
#!/usr/bin/env python
outfile = open ("missing_test.txt", "w")
with open("add_missing.txt", "r") as file:
last_line = 0 #This is where it starts for bin 1
lines = []
header_line = next(file)
outfile.write(header_line)
CHROM = 'BABA_1'
for line in file: #go through every line to check its existence and rewrite to new file
nums = line.split("\t")
num1 = nums[0] #no integer because this is a string: name individual
num2 = int(nums[1]) #integer for window
num3 = int(nums[2]) #integer for coverage (here always 10000 to met treshold)
num4 = int(nums[3]) #integer for SNP count
if num1 == CHROM: #
while num2 != last_line + 10000:
#A line is missing, so a new line is added with 0 SNPs:
NUM2 = last_line + 10000 # New window, the one that was missing
NUM4 = 0 #0 SNPs found
#lines.append((num1, NUM2, num3, NUM4))
OUTLINE = "%s\t%s\t%s\t%s" % (num1, NUM2, num3, NUM4) #write new line to outfile
outfile.write(OUTLINE + "\n")
last_line += 10000
lines.append((num1,num2,num3,num4))
last_line += 10000 #also add 10000 here otherwise the while loop makes no sense
outline = "%s\t%s\t%s\t%s" % (num1, num2, num3, num4)
outfile.write(outline + "\n") #write all existing lines to outfile
else:
CHROM = num1
last_line = 0
outfile.close()
因此,只要第一个 "CHROM" 的第一个 window 等于 0,这就可以很好地工作,但情况并非总是如此。在后一种情况下,循环将是无限的。这是例如输入和 DESIRED 输出的样子:
输入:
indiv window coverage SNP
BABA_1 20000 10000 7
BABA_1 30000 10000 1
BABA_1 50000 10000 2
BABA_1 60000 10000 3
BABA_1 80000 10000 1
BABA_10 20000 10000 1
BABA_10 30000 10000 16
BABA_10 80000 10000 9
期望的输出:
indiv window coverage SNP
BABA_1 10000 10000 0
BABA_1 20000 10000 7
BABA_1 30000 10000 1
BABA_1 40000 10000 0
BABA_1 50000 10000 2
BABA_1 60000 10000 3
BABA_1 70000 10000 0
BABA_1 80000 10000 1
BABA_10 10000 10000 0
BABA_10 20000 10000 1
BABA_10 30000 10000 16
BABA_10 40000 10000 0
BABA_10 50000 10000 0
BABA_10 60000 10000 0
BABA_10 70000 10000 0
BABA_10 80000 10000 9
我一直在努力寻找答案来让我的 while 循环工作而不是无限地继续下去,但我真的没有看到我的缺陷。有没有人告诉我如何解决这个问题?
非常感谢任何帮助,提前致谢!
尝试以下方法:
#!/usr/bin/python
outfile = open ("missing_test.txt", "w")
def write_line(indiv, window, coverage, snp):
outline = "%s\t%s\t%s\t%s\n" % (indiv, window, coverage, snp)
outfile.write(outline)
with open("add_missing.txt", "r") as file:
lines = file.readlines()
write_line(*lines.pop(0).rstrip().split("\t"))
first_line = lines[0].split("\t")
last_indiv = first_line[0]
last_window = int(first_line[1])
for line in lines:
indiv, window, coverage, snp = line.split("\t")
window = int(window)
coverage = int(coverage)
snp = int(snp)
if indiv == last_indiv:
# If the current window is higher than expected,
# insert a line with the missing window.
# Repeat until we get to the expected window.
while window > last_window + 10000:
write_line(indiv, last_window + 10000, coverage, 0)
last_window += 10000
last_window = window
else:
last_indiv = indiv
last_window = window
write_line(indiv, window, coverage, snp)
它不包含预期某个 window 数字是给定 indiv
中的第一个数字,因为您没有定义该行为并且您对此的评论相当混乱.
在 运行 这个脚本之后 missing_test.txt 的内容:
indiv window coverage SNP BABA_1 20000 10000 7 BABA_1 30000 10000 1 BABA_1 40000 10000 0 BABA_1 50000 10000 2 BABA_1 60000 10000 3 BABA_1 70000 10000 0 BABA_1 80000 10000 1 BABA_10 20000 10000 1 BABA_10 30000 10000 16 BABA_10 40000 10000 0 BABA_10 50000 10000 0 BABA_10 60000 10000 0 BABA_10 70000 10000 0 BABA_10 80000 10000 9
您可以使用以下方法,首先构建一个空列表,然后将任何现有条目分配给它,然后再将它们作为行写入输出:
import csv
import itertools
with open('add_missing.txt', 'rb') as f_input, open('missing_test.txt', 'wb') as f_output:
csv_input = csv.reader(f_input, delimiter='\t', skipinitialspace=True)
csv_output = csv.writer(f_output, delimiter='\t')
csv_output.writerow(next(csv_input))
for k, g in itertools.groupby(csv_input, lambda x: x[0]):
empty = [[k, x * 10000, 10000, 0] for x in range(1, 9)]
for row in g:
empty[int(row[1]) / 10000 - 1] = row
csv_output.writerows(empty)
给你:
indiv window coverage SNP
BABA_1 10000 10000 0
BABA_1 20000 10000 7
BABA_1 30000 10000 1
BABA_1 40000 10000 0
BABA_1 50000 10000 2
BABA_1 60000 10000 3
BABA_1 70000 10000 0
BABA_1 80000 10000 1
BABA_10 10000 10000 0
BABA_10 20000 10000 1
BABA_10 30000 10000 16
BABA_10 40000 10000 0
BABA_10 50000 10000 0
BABA_10 60000 10000 0
BABA_10 70000 10000 0
BABA_10 80000 10000 9