如何将分数分配给数据点列表,然后输出值与 python 中的平均值 > 2 个标准差?

How can I assign scores to a list of datapoints and then output values > 2 standard deviations from the mean in python?

我编写了一个脚本来读取包含多行数据的文本文件。一行数据示例:

10      1100    1101    G       G       G       G       G       G       G       G       G       G       G       G       G       G       G       G       G       G       G       G       G       G       G       G       G       G       G       G       G       G       G       G       G       A/G     G       G       G       A/G     G       G       A/G     A/G     G       G       A/G     G       A/G     G       G       G       G       G       .       G       A/G     A/G     G       G       A/G     G       G       G       A/G     G       A       A/G     A/G     A/G     G       G       A/G     G       G       G       A/G     .       G       A       A/G     A       .       A       A       A       G       G       G       A       G       A/G     A/G     A/G     A       G       A/G     A       A       A       A       A       A       A/G     A

我的脚本根据不同字母的相对数量计算每行数据的频率分数作为百分比。目前我的脚本输出行的子集和百分比分数,如果百分比分数 > 0.75。 但是我想让脚本做一些更复杂的事情,但我不知道该怎么做。 1) 对于每一行数据,我希望脚本将第 1、2、4 和 5 列数据保存在一个数组中,并将百分比分数作为附加值添加。 2)然后,一旦脚本读取了文本文件中的所有行,我希望它输出百分比分数高于平均百分比分数> 2个标准差的所有行。

在下面找到我当前的脚本。

(一件小但非必要的额外事情,目前我将每个相关行打印两次,因为如果一个字母的百分比分数>0.75,另一个字母的百分比分数也总是>0.75。为了解决这个问题,我只是必须让脚本在打印一次后继续到下一行数据,但是如果我应该使用 break、continue 或其他东西让脚本继续到下一行而不结束整个,我总是感到困惑脚本。)

inputfile = open('datafile.txt', 'r')
output = open('output.txt', 'w')



#windowstart = 0
for line in inputfile:

    line = line.rstrip() 
    fields = line.split("\t")
    chrom = fields[0]
    pos = str(fields[1])
    allele_one = str(fields[3])
    allele_two = str(fields[4])

#which columns belong to which population   
    PopulationA = fields[3:26]
    PopulationB = fields[26:36]

#sample size of each population 
    PopulationA_popsize = 46
    PopulationB_popsize = 20


#Now count the total number of alleles in each population (Homozygous alleles counted twice, heterozygotes just once)   

#count C allele
    C_count_PopulationA = (2*PopulationA.count("C")) + PopulationA.count("C/T") + PopulationA.count("A/C") + PopulationA.count("C/G")
    percentage_C_PopulationA = float(C_count_PopulationA)/46

#count A allele

    A_count_PopulationA = (2*PopulationA.count("A")) + PopulationA.count("A/T") + PopulationA.count("A/C") + PopulationA.count("A/G")
    percentage_A_PopulationA = float(A_count_PopulationA)/46

#count T allele

    T_count_PopulationA = (2*PopulationA.count("T")) + PopulationA.count("C/T") + PopulationA.count("A/T") + PopulationA.count("G/T")
    percentage_T_PopulationA = float(T_count_PopulationA)/46

#count G allele

    G_count_PopulationA = (2*PopulationA.count("G")) + PopulationA.count("G/T") + PopulationA.count("A/G") + PopulationA.count("C/G")
    percentage_G_PopulationA = float(G_count_PopulationA)/46

#count missing data 
    null_count_PopulationA = (2*PopulationA.count("."))
    percentage_null_PopulationA = float(null_count_PopulationA)/46

#repeat for population B

    C_count_PopulationB = (2*PopulationB.count("C")) + PopulationB.count("C/T") + PopulationB.count("A/C") + PopulationB.count("C/G")
    percentage_C_PopulationB = float(C_count_PopulationB)/20

    A_count_PopulationB = (2*PopulationB.count("A")) + PopulationB.count("A/T") + PopulationB.count("A/C") + PopulationB.count("A/G")
    percentage_A_PopulationB = float(A_count_PopulationB)/20

    T_count_PopulationB = (2*PopulationB.count("T")) + PopulationB.count("C/T") + PopulationB.count("A/T") + PopulationB.count("G/T")
    percentage_T_PopulationB = float(T_count_PopulationB)/20

    G_count_PopulationB = (2*PopulationB.count("G")) + PopulationB.count("G/T") + PopulationB.count("A/G") + PopulationB.count("C/G")
    percentage_G_PopulationB = float(G_count_PopulationB)/20

    null_count_PopulationB = (2*PopulationB.count("."))
    percentage_null_PopulationB = float(null_count_PopulationB)/20



#If missing data less than 10% in both populations  
    if percentage_null_PopulationA < 0.1:
        if percentage_null_PopulationB < 0.1:
#calculate frequency difference between populations for each allele         
            Frequency_diff_C_PopulationA_PopulationB = float(abs(percentage_C_PopulationA - percentage_C_PopulationB))
            Frequency_diff_A_PopulationA_PopulationB = float(abs(percentage_A_PopulationA - percentage_A_PopulationB))
            Frequency_diff_T_PopulationA_PopulationB = float(abs(percentage_T_PopulationA - percentage_T_PopulationB))
            Frequency_diff_G_PopulationA_PopulationB = float(abs(percentage_G_PopulationA - percentage_G_PopulationB))
#if the frequency difference between alleles is greater than 0.75, print part of the row            
            if Frequency_diff_C_PopulationA_PopulationB >= 0.75:

                print >> output, str(chrom) + "\t" + str(pos) + "\t" + str(allele_one) + "\t" + str(allele_two)

            if Frequency_diff_A_PopulationA_PopulationB >= 0.75:

                print >> output, str(chrom) + "\t" + str(pos) + "\t" + str(allele_one) + "\t" + str(allele_two)

            if Frequency_diff_T_PopulationA_PopulationB >= 0.75:

                print >> output, str(chrom) + "\t" + str(pos) + "\t" + str(allele_one) + "\t" + str(allele_two)

            if Frequency_diff_G_PopulationA_PopulationB >= 0.75:

                print >> output, str(chrom) + "\t" + str(pos) + "\t" + str(allele_one) + "\t" + str(allele_two)

我正在寻找每一行人群之间的等位基因频率差异。因此,例如,如果我们假设群体 A 中有 10 个个体(前 10 列核苷酸),群体 B 中有 10 个个体(最后 10 列核苷酸),那么在下面的示例数据行中,我们看到群体 A 有 10 个 G 核苷酸。种群 B 有 3 个 G 核苷酸和 7 个 A 核苷酸。因此 2 个种群之间的频率差异为 70%。

10      20    21    G       G       G       G       G       G       G       G       G       G      G       G       G       A       A       A       A      A       A       A   

一旦谈到大量数据的均值和标准差,您就应该开始使用任何数值库。考虑在这里使用 numpy,甚至 pandas(为了可读性)。我将在这个例子中使用它们,连同 Counter object from the collections module。阅读两者以了解它们是如何工作的,但我也会在整个代码中进行一些解释。

import numpy as np
from collections import Counter    

nucleotid_bases = ('C', 'A', 'T', 'G', '.')
results = []
checksum = []
with open('datafile.txt') as f:
    for line in f:
        fields = line.split()  # splits by consecutive whitespace, empty records will be purged
        chrom, pos = [int(fields[x]) for x in (0,1)]
        results.append([chrom,pos])  # start by building the current record
        allele1, allele2 = [fields[i] for i in (3,4)]
        checksum.append([allele1, allele2])  # you wanted to keep these, most likely for debugging purposes?
        popA = fields[3:26]  # population size: 2*23
        popB = fields[26:36]  # population size: 2*10
        for population in (popA, popB):
            summary = Counter(population) # traverses the line only once - much more efficient!
            base_counts = [ sum(summary[k] for k in summary.keys() if base in k) for base in nucleotid_bases]
            for index, base_name in enumerate(nucleotid_bases):
                # Double the count when there is an exact match, e.g. "A/A" -> "A"
                # An 'in' match can match an item anywhere in the string: 'A' in 'A/C' evaluates to True
                base_counts[index] += summary[base_name]    
            results[-1].extend(base_counts)  # append to the current record
results = np.array(results, dtype=np.float)  # shape is now (x, 12) with x the amount of lines read
results[:, 2:7] /= 46
results[:, 7:] /= 20

此时,results的布局是两列填充了chromresults[:,0])和posresults[:,1])标签从文本文件中, 然后是人口 A 的 5 列,其中这 5 列中的第一列包含 'C' 基数的相对频率,接下来 'A' 基础的列等等(参见 nucleotid_bases 的顺序)。然后,最后 5 列类似,但它们是针对人口 B:

chrom, pos, freqC_in_A,..., freqG_in_A, freq_dot_in_A freqC_in_B, ..., freqG_in_B, freq_dot_in_B

如果您想忽略此 table 中任一未知频率(第 6 列和第 11 列)高于阈值的记录(行),您可以这样做:

threshold = .1 # arbitrary: 10%
to_consider = np.logical_and(results[:,6] < threshold, results[:,11] < threshold)
table = results[to_consider][:, [0,1,2,3,4,5,7,8,9,10]]

现在您可以计算 table 频率差异:

freq_diffs  = np.abs(table[:,2:6] - table[:,-4:])  # 4 columns, n rows

mean_freq_diff = freq_diffs.mean(axis=0) # holds 4 numbers, these are the means over all the rows
std_freq_diff = freq_diffs.std(axis=0) # similar: std over all the rows

condition = freq_diffs > (mean_freq_diff + 2*std_freq_diff)

现在您需要检查 condition 是否对该行的任何元素有效,例如如果 popA 和 popB 之间 'C' 的频率差是 .8,而 (mean+2*std) 是 .7,那么它将 return 为真。但它也会 return True 对于同一行,如果其他任何一个都满足此条件 核苷酸。要检查任何核苷酸频率差异的条件是否为真,请执行以下操作:

specials = np.any(condition, axis=1)  
print(table[specials, :2])