将 VCF 文件的信息解析为 R 数据帧
Parse VCF file's INFO to an R dataframe
我正在尝试从 vcf 文件创建一个数据帧,其中仅包含 INFO 字段中的一些元素。问题是这些元素的值并不总是在相同的位置,所以当我加载 VCF 和拆分 INFO 字段时,我在不同的列中得到这些特定元素。
例如:
Pos Score Strand Length
CIPOS=0 SCORE=1 STRAND=+ LEN=634
SCORE=89 STRAND=- LEN=567 UTR=+
CIPOS=9 SCORE=1 STRAND=+ LEN=0
CIPOS=8 SCORE=1 STRAND=+ LEN=1
STRAND=+ LEN=555 UTR=+ B
如你所见,有些行被移动了,因为vcf中没有符号表示缺少某些INFO元素,并且字段信息是作为字符串读取的,所以拆分时我不知道如何告诉R在每一列的对应行写一个NA。
有什么方法可以在 Score 列中写入每个 "SCORE=" 值,在 Strand 列中写入每个 "STRAND=" 值,等等?
提前致谢!
我不确定这是否是最直接的方法,但它应该有效。
首先,尝试使用 dput(yourdata)
post 您的数据。这样人们就可以更轻松地使用您的具体示例。
我假设每一行由一个"sample"组成,所以诀窍是先添加样本标识符,然后将其放入长格式,然后再转回宽格式。
数据:
a <- data.frame(Pos = c("CIPOS=0",
"SCORE=89",
"CIPOS=9",
"CIPOS=8",
"STRAND=+"),
Score = c("SCORE=1",
"STRAND=-",
"SCORE=1",
"SCORE=1",
"LEN=555"),
Strand = c("STRAND=+",
"LEN=567",
"STRAND=+",
"STRAND=+",
"UTR=+")
)
dput(a)
#> structure(list(Pos = structure(c(1L, 4L, 3L, 2L, 5L), .Label = c("CIPOS=0",
#> "CIPOS=8", "CIPOS=9", "SCORE=89", "STRAND=+"), class = "factor"),
#> Score = structure(c(2L, 3L, 2L, 2L, 1L), .Label = c("LEN=555",
#> "SCORE=1", "STRAND=-"), class = "factor"), Strand = structure(c(2L,
#> 1L, 2L, 2L, 3L), .Label = c("LEN=567", "STRAND=+", "UTR=+"
#> ), class = "factor")), class = "data.frame", row.names = c(NA,
#> -5L))
然后如果您 运行 此代码:
library(tidyverse)
a %>% mutate(sample = row_number()) %>%
pivot_longer(-sample) %>%
# here you have to unselect the name column, since this is actually wrong
select(-name) %>%
# separate the strings that contain the data
separate(value, into = c("group", "value"), sep = "=") %>%
# put it back to wide format
pivot_wider( names_from = group,
values_from = value)
#> # A tibble: 5 x 6
#> sample CIPOS SCORE STRAND LEN UTR
#> <int> <chr> <chr> <chr> <chr> <chr>
#> 1 1 0 1 + <NA> <NA>
#> 2 2 <NA> 89 - 567 <NA>
#> 3 3 9 1 + <NA> <NA>
#> 4 4 8 1 + <NA> <NA>
#> 5 5 <NA> <NA> + 555 +
由 reprex package (v0.3.0)
于 2020-03-04 创建
有用于此目的的包,例如来自 Bioconductor 的 VariantAnnotation。一旦你读入了 vcf 文件,信息就会被打包到一个 data.frame 中,你可以像下面这样评估它:
library(VariantAnnotation)
fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation")
vcf <- readVcf(fl, "hg19")
info(vcf)
DataFrame with 10376 rows and 22 columns
LDAF AVGPOST RSQ ERATE THETA
<numeric> <numeric> <numeric> <numeric> <numeric>
rs7410291 0.3431 0.989 0.9856 0.002 5e-04
rs147922003 0.0091 0.9963 0.8398 5e-04 0.0011
rs114143073 0.0098 0.9891 0.5919 7e-04 8e-04
rs141778433 0.0062 0.995 0.6756 9e-04 3e-04
rs182170314 0.0041 0.9981 0.7909 7e-04 4e-04
... ... ... ... ... ...
rs187302552 9e-04 0.9992 0.5571 3e-04 0.0026
rs9628178 0.0727 0.9997 0.9983 3e-04 0.0011
rs5770892 0.3678 0.9868 0.9784 0.0021 7e-04
rs144055359 0.0011 0.9987 0.5323 5e-04 4e-04
rs114526001 0.0543 0.9622 0.7595 0.0021 5e-04
CIEND CIPOS END HOMLEN
<IntegerList> <IntegerList> <integer> <IntegerList>
rs7410291 NA,NA NA,NA NA
rs147922003 NA,NA NA,NA NA
rs114143073 NA,NA NA,NA NA
rs141778433 NA,NA NA,NA NA
rs182170314 NA,NA NA,NA NA
... ... ... ... ...
rs187302552 NA,NA NA,NA NA
rs9628178 NA,NA NA,NA NA
rs5770892 NA,NA NA,NA NA
rs144055359 NA,NA NA,NA NA
rs114526001 NA,NA NA,NA NA
HOMSEQ SVLEN SVTYPE AC
<CharacterList> <integer> <character> <IntegerList>
rs7410291 NA NA 751
rs147922003 NA NA 20
rs114143073 NA NA 20
rs141778433 NA NA 12
rs182170314 NA NA 8
... ... ... ... ...
rs187302552 NA NA 1
rs9628178 NA NA 158
rs5770892 NA NA 801
rs144055359 NA NA 1
rs114526001 NA NA 113
AN AA AF AMR_AF ASN_AF
<integer> <character> <numeric> <numeric> <numeric>
rs7410291 2184 N 0.34 0.2 0.19
rs147922003 2184 c 0.01 0.01 NA
rs114143073 2184 G 0.01 0.0028 0.02
rs141778433 2184 C 0.01 0.01 NA
rs182170314 2184 C 0.0037 0.01 NA
... ... ... ... ... ...
rs187302552 2184 a 5e-04 NA 0.0017
rs9628178 2184 a 0.07 0.03 0.01
rs5770892 2184 a 0.37 0.32 0.38
rs144055359 2184 g 5e-04 NA NA
rs114526001 2184 g 0.05 0.01 0.01
AFR_AF EUR_AF VT SNPSOURCE
<numeric> <numeric> <character> <CharacterList>
rs7410291 0.83 0.22 SNP LOWCOV
rs147922003 0.02 0.01 SNP LOWCOV
rs114143073 0.01 0.01 SNP LOWCOV
rs141778433 0.02 NA SNP LOWCOV
rs182170314 0.01 NA SNP LOWCOV
... ... ... ... ...
rs187302552 NA NA SNP LOWCOV
rs9628178 0.17 0.08 SNP LOWCOV
rs5770892 0.59 0.23 SNP LOWCOV
rs144055359 NA 0.0013 SNP LOWCOV
rs114526001 0.16 0.04 SNP LOWCOV
您可以转换为 data.frame 并评估列,在此示例中没有链信息,但如果您有链,它应该可以工作:
df = as.data.frame(info(vcf))
df$CIPOS
我正在尝试从 vcf 文件创建一个数据帧,其中仅包含 INFO 字段中的一些元素。问题是这些元素的值并不总是在相同的位置,所以当我加载 VCF 和拆分 INFO 字段时,我在不同的列中得到这些特定元素。
例如:
Pos Score Strand Length
CIPOS=0 SCORE=1 STRAND=+ LEN=634
SCORE=89 STRAND=- LEN=567 UTR=+
CIPOS=9 SCORE=1 STRAND=+ LEN=0
CIPOS=8 SCORE=1 STRAND=+ LEN=1
STRAND=+ LEN=555 UTR=+ B
如你所见,有些行被移动了,因为vcf中没有符号表示缺少某些INFO元素,并且字段信息是作为字符串读取的,所以拆分时我不知道如何告诉R在每一列的对应行写一个NA。
有什么方法可以在 Score 列中写入每个 "SCORE=" 值,在 Strand 列中写入每个 "STRAND=" 值,等等?
提前致谢!
我不确定这是否是最直接的方法,但它应该有效。
首先,尝试使用 dput(yourdata)
post 您的数据。这样人们就可以更轻松地使用您的具体示例。
我假设每一行由一个"sample"组成,所以诀窍是先添加样本标识符,然后将其放入长格式,然后再转回宽格式。
数据:
a <- data.frame(Pos = c("CIPOS=0",
"SCORE=89",
"CIPOS=9",
"CIPOS=8",
"STRAND=+"),
Score = c("SCORE=1",
"STRAND=-",
"SCORE=1",
"SCORE=1",
"LEN=555"),
Strand = c("STRAND=+",
"LEN=567",
"STRAND=+",
"STRAND=+",
"UTR=+")
)
dput(a)
#> structure(list(Pos = structure(c(1L, 4L, 3L, 2L, 5L), .Label = c("CIPOS=0",
#> "CIPOS=8", "CIPOS=9", "SCORE=89", "STRAND=+"), class = "factor"),
#> Score = structure(c(2L, 3L, 2L, 2L, 1L), .Label = c("LEN=555",
#> "SCORE=1", "STRAND=-"), class = "factor"), Strand = structure(c(2L,
#> 1L, 2L, 2L, 3L), .Label = c("LEN=567", "STRAND=+", "UTR=+"
#> ), class = "factor")), class = "data.frame", row.names = c(NA,
#> -5L))
然后如果您 运行 此代码:
library(tidyverse)
a %>% mutate(sample = row_number()) %>%
pivot_longer(-sample) %>%
# here you have to unselect the name column, since this is actually wrong
select(-name) %>%
# separate the strings that contain the data
separate(value, into = c("group", "value"), sep = "=") %>%
# put it back to wide format
pivot_wider( names_from = group,
values_from = value)
#> # A tibble: 5 x 6
#> sample CIPOS SCORE STRAND LEN UTR
#> <int> <chr> <chr> <chr> <chr> <chr>
#> 1 1 0 1 + <NA> <NA>
#> 2 2 <NA> 89 - 567 <NA>
#> 3 3 9 1 + <NA> <NA>
#> 4 4 8 1 + <NA> <NA>
#> 5 5 <NA> <NA> + 555 +
由 reprex package (v0.3.0)
于 2020-03-04 创建有用于此目的的包,例如来自 Bioconductor 的 VariantAnnotation。一旦你读入了 vcf 文件,信息就会被打包到一个 data.frame 中,你可以像下面这样评估它:
library(VariantAnnotation)
fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation")
vcf <- readVcf(fl, "hg19")
info(vcf)
DataFrame with 10376 rows and 22 columns
LDAF AVGPOST RSQ ERATE THETA
<numeric> <numeric> <numeric> <numeric> <numeric>
rs7410291 0.3431 0.989 0.9856 0.002 5e-04
rs147922003 0.0091 0.9963 0.8398 5e-04 0.0011
rs114143073 0.0098 0.9891 0.5919 7e-04 8e-04
rs141778433 0.0062 0.995 0.6756 9e-04 3e-04
rs182170314 0.0041 0.9981 0.7909 7e-04 4e-04
... ... ... ... ... ...
rs187302552 9e-04 0.9992 0.5571 3e-04 0.0026
rs9628178 0.0727 0.9997 0.9983 3e-04 0.0011
rs5770892 0.3678 0.9868 0.9784 0.0021 7e-04
rs144055359 0.0011 0.9987 0.5323 5e-04 4e-04
rs114526001 0.0543 0.9622 0.7595 0.0021 5e-04
CIEND CIPOS END HOMLEN
<IntegerList> <IntegerList> <integer> <IntegerList>
rs7410291 NA,NA NA,NA NA
rs147922003 NA,NA NA,NA NA
rs114143073 NA,NA NA,NA NA
rs141778433 NA,NA NA,NA NA
rs182170314 NA,NA NA,NA NA
... ... ... ... ...
rs187302552 NA,NA NA,NA NA
rs9628178 NA,NA NA,NA NA
rs5770892 NA,NA NA,NA NA
rs144055359 NA,NA NA,NA NA
rs114526001 NA,NA NA,NA NA
HOMSEQ SVLEN SVTYPE AC
<CharacterList> <integer> <character> <IntegerList>
rs7410291 NA NA 751
rs147922003 NA NA 20
rs114143073 NA NA 20
rs141778433 NA NA 12
rs182170314 NA NA 8
... ... ... ... ...
rs187302552 NA NA 1
rs9628178 NA NA 158
rs5770892 NA NA 801
rs144055359 NA NA 1
rs114526001 NA NA 113
AN AA AF AMR_AF ASN_AF
<integer> <character> <numeric> <numeric> <numeric>
rs7410291 2184 N 0.34 0.2 0.19
rs147922003 2184 c 0.01 0.01 NA
rs114143073 2184 G 0.01 0.0028 0.02
rs141778433 2184 C 0.01 0.01 NA
rs182170314 2184 C 0.0037 0.01 NA
... ... ... ... ... ...
rs187302552 2184 a 5e-04 NA 0.0017
rs9628178 2184 a 0.07 0.03 0.01
rs5770892 2184 a 0.37 0.32 0.38
rs144055359 2184 g 5e-04 NA NA
rs114526001 2184 g 0.05 0.01 0.01
AFR_AF EUR_AF VT SNPSOURCE
<numeric> <numeric> <character> <CharacterList>
rs7410291 0.83 0.22 SNP LOWCOV
rs147922003 0.02 0.01 SNP LOWCOV
rs114143073 0.01 0.01 SNP LOWCOV
rs141778433 0.02 NA SNP LOWCOV
rs182170314 0.01 NA SNP LOWCOV
... ... ... ... ...
rs187302552 NA NA SNP LOWCOV
rs9628178 0.17 0.08 SNP LOWCOV
rs5770892 0.59 0.23 SNP LOWCOV
rs144055359 NA 0.0013 SNP LOWCOV
rs114526001 0.16 0.04 SNP LOWCOV
您可以转换为 data.frame 并评估列,在此示例中没有链信息,但如果您有链,它应该可以工作:
df = as.data.frame(info(vcf))
df$CIPOS