将 .gprobs 文件从 Impute2 格式转换为 PLINK 格式
Convert .gprobs files from Impute2 to PLINK format
我有一些估算的 .gprobs 文件(每条染色体一个),由从 dbGaP 下载的 Impute2 估算,我需要将此文件转换为 PLINK 的 .bed 格式以便进行一些分析。
我的 .gprobs 文件看起来像:
--- rs371609562:61395:CTT:C 61395 CTT C 0 0.023 0.977 0 0.039 0.961 0 0.015 0.985 0 0.026 0.974 0 0 1 0 0 1 0 0 1
谁能帮我看看如何将这种文件转换成PLINK格式?或者指导我执行转换需要哪些文件?
P.D.: 我知道这个问题可能不应该在这里,但我不知道还有什么地方可以问。
.gprobs
看来您的意思是 牛津格式,请参阅:
https://www.cog-genomics.org/plink/1.9/formats#gen
如果这是正确的,那么 plink 可以按照此处所述的格式读取:
https://www.cog-genomics.org/plink/1.9/input#oxford
在同一命令中,您可以输出为 PLINK 二进制格式:
plink --gen file.gen --sample file.sample --make-bed --out output_prefix
注意以下有关将 Oxford 转换为 PLINK 的注意事项:
Since the PLINK 1 binary format cannot represent genotype
probabilities, calls with uncertainty greater than 0.1 are normally
treated as missing, and the rest are treated as hard calls. You can
adjust this threshold by providing a numeric parameter to
--hard-call-threshold.
Alternatively, when --hard-call-threshold is given the 'random'
modifier, calls are independently randomized according to the
probabilities in the file. (This is not ideal; it would be better to
randomize in a haploblock-sensitive manner. But resampling a bunch of
times with this and generating an empirical distribution of some
statistic can still be more informative than applying a single
threshold and calculating that statistic once.)
我有一些估算的 .gprobs 文件(每条染色体一个),由从 dbGaP 下载的 Impute2 估算,我需要将此文件转换为 PLINK 的 .bed 格式以便进行一些分析。
我的 .gprobs 文件看起来像:
--- rs371609562:61395:CTT:C 61395 CTT C 0 0.023 0.977 0 0.039 0.961 0 0.015 0.985 0 0.026 0.974 0 0 1 0 0 1 0 0 1
谁能帮我看看如何将这种文件转换成PLINK格式?或者指导我执行转换需要哪些文件?
P.D.: 我知道这个问题可能不应该在这里,但我不知道还有什么地方可以问。
.gprobs
看来您的意思是 牛津格式,请参阅:
https://www.cog-genomics.org/plink/1.9/formats#gen
如果这是正确的,那么 plink 可以按照此处所述的格式读取:
https://www.cog-genomics.org/plink/1.9/input#oxford
在同一命令中,您可以输出为 PLINK 二进制格式:
plink --gen file.gen --sample file.sample --make-bed --out output_prefix
注意以下有关将 Oxford 转换为 PLINK 的注意事项:
Since the PLINK 1 binary format cannot represent genotype probabilities, calls with uncertainty greater than 0.1 are normally treated as missing, and the rest are treated as hard calls. You can adjust this threshold by providing a numeric parameter to --hard-call-threshold.
Alternatively, when --hard-call-threshold is given the 'random' modifier, calls are independently randomized according to the probabilities in the file. (This is not ideal; it would be better to randomize in a haploblock-sensitive manner. But resampling a bunch of times with this and generating an empirical distribution of some statistic can still be more informative than applying a single threshold and calculating that statistic once.)