R 从 vcf 文件中提取子字符串

Question

我有来自 VCF 文件（变体调用格式）的数据，我想在 R 中使用这些数据。数据通常如下所示：

0/1:127,38:165:99:255,0,255
0/0:127,0:127:99:0,255,255
1/1:0,127:127:99:255,255,0

我需要拉取的信息是（第一行）：

0/1,
127, and
38

为清楚起见：我将从第二行检索的信息：

0/0,
127, and 
0

从第三行开始：

1/1,
0, and
127

（字符串中的剩余信息暂时不感兴趣。）

这可以在 R 中完成吗？我将非常感谢对此的反馈。

谢谢。 S

Answer 1

1) 将冒号替换为逗号，然后使用 read.table:

读入

read.table(text = gsub(":", ",", L), sep = ",", as.is = TRUE)[1:3]

给予：

   V1  V2  V3
1 0/1 127  38
2 0/0 127   0
3 1/1   0 127

2) 另一种选择是 read.pattern 在 gsubfn package:

library(gsubfn)

read.pattern(text = L, pattern = "^(.*?):(.*?),(.*?):", as.is = TRUE)

给出相同的结果。这是正则表达式的可视化。正则表达式中 ? 的出现导致 .* 匹配可能的最短而不是最长的字符串：

^(.*?):(.*?),(.*?):

Debuggex Demo

注意：我们使用了这个输入数据：

L <- "0/1:127,38:165:99:255,0,255
0/0:127,0:127:99:0,255,255
1/1:0,127:127:99:255,255,0"

Answer 2

另一种解决方案是使用 VariantAnnotation; read the vignette and see ?readVcf and be sure to use ScanVcfParam() to selectively read just those parts of the file you're interested in. Ask for more help on the Bioconductor support forum 如果这似乎是一种有用的方法。

R 从 vcf 文件中提取子字符串

R pulling out sub strings from vcf files

substring

r