在 EggLib 中指示种群结构 Python

Indicating a population structure in EggLib Python

在Python中,我正在使用EggLib。我正在尝试计算在 VCF 文件中找到的每个 SNP 的 Jost 的 D 值。

数据

数据为 here VCF 格式。数据集很小,有2个种群,每个种群100个个体,6个SNP(都在1号染色体上)。

每个个体都被命名为Pp.Ii,其中p是它所属的种群指数,i是个体指数。

代码

我的困难在于人口结构的规范。这是我的试用版

### Read the vcf file ###
vcf = egglib.io.VcfParser("MyData.vcf") 

### Create the `Structure` object ###
# Dictionary for a given cluster. There is only one cluster.
dcluster = {}            
# Loop through each population 
for popIndex in [0,1]:  
    # dictionnary for a given population. There are two populations
    dpop = {}            
    # Loop through each individual
    for IndIndex in range(popIndex * 100,(popIndex + 1) * 100):     
            # A single list to define an individual
        dpop[IndIndex] = [IndIndex*2, IndIndex*2 + 1]
    dcluster[popIndex] = dpop

struct = {0: dcluster}

### Define the population structure ###
Structure = egglib.stats.make_structure(struct, None) 

### Configurate the 'ComputeStats' object ###
cs = egglib.stats.ComputeStats()
cs.configure(only_diallelic=False)
cs.add_stats('Dj') # Jost's D

### Isolate a SNP ###
vcf.next()
site = egglib.stats.site_from_vcf(vcf)

### Calculate Jost's D ###
cs.process_site(site, struct=Structure)

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Python/2.7/site-packages/egglib/stats/_cstats.py", line 431, in process_site
    self._frq.process_site(site, struct=struct)
  File "/Library/Python/2.7/site-packages/egglib/stats/_freq.py", line 159, in process_site
    if sum(struct) != site._obj.get_ning(): raise ValueError, 'invalid structure (sample size is required to match)'
ValueError: invalid structure (sample size is required to match)

文档表明 here

[The Structure object] is a tuple containing two items, each being a dict. The first one represents the ingroup and the second represents the outgroup.

The ingroup dictionary is itself a dictionary holding more dictionaries, one for each cluster of populations. Each cluster dictionary is a dictionary of populations, populations being themselves represented by a dictionary. A population dictionary is, again, a dictionary of individuals. Fortunately, individuals are represented by lists.

An individual list contains the index of all samples belonging to this individual. For haploid data, individuals will be one-item lists. In other cases, all individual lists are required to have the same number of items (consistent ploidy). Note that, if the ploidy is more than one, nothing enforces that samples of a given individual are grouped within the original data.

The keys of the ingroup dictionary are the labels identifying each cluster. Within a cluster dictionary, the keys are population labels. Finally, within a population dictionary, the keys are individual labels.

The second dictionary represents the outgroup. Its structure is simpler: it has individual labels as keys, and lists of corresponding sample indexes as values. The outgroup dictionary is similar to any ingroup population dictionary. The ploidy is required to match over all ingroup and outgroup individuals.

但我无法理解它。提供的示例是fasta格式的,我不明白将逻辑扩展到VCF格式。

有两个错误

第一个错误

函数 make_structure returns Structure 对象但不将其保存在 stats 中。因此,您必须保存此输出并在函数 process_site.

中使用它
Structure = egglib.stats.make_structure(struct, None) 

第二个错误

结构对象必须指定单倍体。因此,将字典创建为

dcluster = {}            
for popIndex in [0,1]:  
    dpop = {}            
    for IndIndex in range(popIndex * 100,(popIndex + 1) * 100):     
        dpop[IndIndex] = [IndIndex]
    dcluster[popIndex] = dpop

struct = {0: dcluster}