验证前一行是否具有与当前行相同的字符串和另一列的总和值
Verify if previous line have same string than current line and sum value of another column
我想要做的是这样的脚本读取当前文件:
chr1,700244,714068,LOC100288069,982
chr1,1568158,1570027,MMP23A,784
chr1,1567559,1570030,MMP23A,784
chr1,1849028,1850740,TMEM52,799
chr1,2281852,2284100,LOC100129534,934
chr1,2281852,2284100,LOC100129534,800
chr1,2460183,2461684,HES5,819
chr1,2460183,2461684,HES5,850
chr1,2517898,2522908,FAM213B,834
chr1,2518188,2522908,FAM213B,834
chr1,2518188,2522908,FAM213B,834
chr1,2518188,2522908,FAM213B,834
chr1,2517898,2522908,FAM213B,834
如果第 3 列在一行中重复,则对第 4 列的值求和并得出此类总和的平均值。输出应该是:
chr1,700244,714068,LOC100288069,982
chr1,1568158,1570027,MMP23A,784
chr1,1849028,1850740,TMEM52,799
chr1,2281852,2284100,LOC100129534,934
chr1,2460183,2461684,HES5,834.5
chr1,2517898,2522908,FAM213B,867
我试过这个脚本,但它不起作用。谁能给我一些提示?
f1 = open('path', 'r')
reader1 = f1.read()
f3 = open('path/B_Media.txt','wb')
for line1 in f1:
coluna = line1.split(',')
chr = coluna[0]
start = coluna[1]
end = coluna[2]
gene = coluna[3]
valor_B = coluna[4]
previous_line = current_line
current_line = line
gene2 = previous_line[3]
soma_B2 = previous_line[4]
soma_de_B = int(valor_B)+int(soma_B2)
if gene == gene2:
x += 1
media_gene = soma_de_B/x
output = chr + "," + start + "," + end + "," + gene + "," +valor_B+","+media_gene
f3.write(output)
f3.flush()
print output
因为你需要知道接下来会发生什么(以逐行阅读的方式来说),我会把阅读和写作分成两个不同的部分。
此外,csv
-模块可能会派上用场,因为您不必处理任何特殊情况(如文本中的逗号等),而且 reading/writing 非常简单.使用 with
打开文件通常是一个好习惯,因为关闭它是自动处理的。
现在开始一些代码:-)
from __future__ import division
import csv
gene = 3
valor_B = 4
data = []
with open('data.csv', 'r') as readfile:
reader = csv.reader(readfile)
for row in reader:
data.append(row)
values_to_add = []
with open('B_Media.txt','wb') as writefile:
writer = csv.writer(writefile)
for i in range(len(data)):
values_to_add.append(int(data[i][valor_B]))
# if last row or row is different from previous, write it
if i == len(data)-1 or data[i][gene] != data[i+1][gene]:
data[i][valor_B] = sum(values_to_add)/len(values_to_add)
writer.writerow(data[i])
values_to_add = []
基本上它首先从输入文件中读取所有内容并将其放入data
。然后,with
输出文件,它遍历每一行,执行以下操作:
- 将我们最终将写入的第 4 列的值添加到要写入的值列表中(可能不是现在,但最终)
- 如果我们遇到的一行不同于前一行 或 最后一行(我们也需要抓住那一行!),写入输出。如果我们这样做,我们将取到目前为止收集的值列表的平均值(至少 1 个,可能 2 个或更多)。我们使用
sum()/len()
计算平均值,并用我们的新值替换相应的列,然后将其写入输出文件。
- 如果不是这样,什么都不做!第 4 列的值已在第一步中添加到列表中,因此我们可以向前一步到下一行。
结果:
chr1,700244,714068,LOC100288069,982.0
chr1,1567559,1570030,MMP23A,784.0
chr1,1849028,1850740,TMEM52,799.0
chr1,2281852,2284100,LOC100129534,867.0
chr1,2460183,2461684,HES5,834.5
chr1,2517898,2522908,FAM213B,834.0
(您可能认识 from __future__ import division
语句,它确保我们在除法时可以有非整数值,例如 834.5
。)
我想要做的是这样的脚本读取当前文件:
chr1,700244,714068,LOC100288069,982
chr1,1568158,1570027,MMP23A,784
chr1,1567559,1570030,MMP23A,784
chr1,1849028,1850740,TMEM52,799
chr1,2281852,2284100,LOC100129534,934
chr1,2281852,2284100,LOC100129534,800
chr1,2460183,2461684,HES5,819
chr1,2460183,2461684,HES5,850
chr1,2517898,2522908,FAM213B,834
chr1,2518188,2522908,FAM213B,834
chr1,2518188,2522908,FAM213B,834
chr1,2518188,2522908,FAM213B,834
chr1,2517898,2522908,FAM213B,834
如果第 3 列在一行中重复,则对第 4 列的值求和并得出此类总和的平均值。输出应该是:
chr1,700244,714068,LOC100288069,982
chr1,1568158,1570027,MMP23A,784
chr1,1849028,1850740,TMEM52,799
chr1,2281852,2284100,LOC100129534,934
chr1,2460183,2461684,HES5,834.5
chr1,2517898,2522908,FAM213B,867
我试过这个脚本,但它不起作用。谁能给我一些提示?
f1 = open('path', 'r')
reader1 = f1.read()
f3 = open('path/B_Media.txt','wb')
for line1 in f1:
coluna = line1.split(',')
chr = coluna[0]
start = coluna[1]
end = coluna[2]
gene = coluna[3]
valor_B = coluna[4]
previous_line = current_line
current_line = line
gene2 = previous_line[3]
soma_B2 = previous_line[4]
soma_de_B = int(valor_B)+int(soma_B2)
if gene == gene2:
x += 1
media_gene = soma_de_B/x
output = chr + "," + start + "," + end + "," + gene + "," +valor_B+","+media_gene
f3.write(output)
f3.flush()
print output
因为你需要知道接下来会发生什么(以逐行阅读的方式来说),我会把阅读和写作分成两个不同的部分。
此外,csv
-模块可能会派上用场,因为您不必处理任何特殊情况(如文本中的逗号等),而且 reading/writing 非常简单.使用 with
打开文件通常是一个好习惯,因为关闭它是自动处理的。
现在开始一些代码:-)
from __future__ import division
import csv
gene = 3
valor_B = 4
data = []
with open('data.csv', 'r') as readfile:
reader = csv.reader(readfile)
for row in reader:
data.append(row)
values_to_add = []
with open('B_Media.txt','wb') as writefile:
writer = csv.writer(writefile)
for i in range(len(data)):
values_to_add.append(int(data[i][valor_B]))
# if last row or row is different from previous, write it
if i == len(data)-1 or data[i][gene] != data[i+1][gene]:
data[i][valor_B] = sum(values_to_add)/len(values_to_add)
writer.writerow(data[i])
values_to_add = []
基本上它首先从输入文件中读取所有内容并将其放入data
。然后,with
输出文件,它遍历每一行,执行以下操作:
- 将我们最终将写入的第 4 列的值添加到要写入的值列表中(可能不是现在,但最终)
- 如果我们遇到的一行不同于前一行 或 最后一行(我们也需要抓住那一行!),写入输出。如果我们这样做,我们将取到目前为止收集的值列表的平均值(至少 1 个,可能 2 个或更多)。我们使用
sum()/len()
计算平均值,并用我们的新值替换相应的列,然后将其写入输出文件。 - 如果不是这样,什么都不做!第 4 列的值已在第一步中添加到列表中,因此我们可以向前一步到下一行。
结果:
chr1,700244,714068,LOC100288069,982.0
chr1,1567559,1570030,MMP23A,784.0
chr1,1849028,1850740,TMEM52,799.0
chr1,2281852,2284100,LOC100129534,867.0
chr1,2460183,2461684,HES5,834.5
chr1,2517898,2522908,FAM213B,834.0
(您可能认识 from __future__ import division
语句,它确保我们在除法时可以有非整数值,例如 834.5
。)