在 R 中提取基因注释 ID

Question

我有一个注释文件，我想解析出 FlyBase 转录 ID 以创建一个新列。我试过正则表达式，但没有用。不确定我是否可能没有正确使用它。 ID 在字符串的开头或中间，在这种情况下是来自不同数据库的 ID 的集合。也可能有多个 FlyBase ID，在这种情况下我想使用像 ID1/ID2 这样的分隔符。

示例注释行："AY113634 // --- // 100 // 2 // 2 // 0 /// FBtr0089787 // --- // 100 // 2 // 2 // 0"

"FBtr0079338 // --- // 100 // 15 // 15 // 0 /// FBtr0086326 // --- // 100 // 15 // 15 // 0 /// FBtr0100846 // --- // 100 // 15 // 15 // 0 /// NONDMET000145 // --- // 100 // 15 // 15 // 0 /// NONDMET000970 // --- // 100 // 15 // 15 // 0 /// NONDMET000971 // --- // 100 // 15 // 15 // 0"

我想创建一个保持相同顺序但只包含 FlyBase ID 的列，必要时使用分隔符。我正在使用 data.table 包，所以如果有使用数据表的解决方案，我将不胜感激。我的一个想法是使用 sub，搜索 [FBtr][0-9+]（不确定是否正确），如果它与该模式不匹配，则将其替换为 ""。

示例Table： x <- data.table(probesetID = 1:10, probesetType = rep("main", 10), rep("FBtr0299871 // --- // 100 // FBtr193920 // 3 // 3 // 0", 10))

Answer 1

这里有一些可以帮助您入门的东西，一旦我更好地了解您的 "data.table" 是什么样子，我就可以更新答案：

x <- "FBtr0079338 // --- // 100 // 15 // 15 // 0 /// FBtr0086326 // --- // 100 // 15 // 15 // 0 /// FBtr0100846 // --- // 100 // 15 // 15 // 0 /// NONDMET000145 // --- // 100 // 15 // 15 // 0 /// NONDMET000970 // --- // 100 // 15 // 15 // 0 /// NONDMET000971 // --- // 100 // 15 // 15 // 0"
sapply(strsplit(x, "/+"), function(s) grep("FBtr", trimws(s), value=TRUE))

#     [,1]         
#[1,] "FBtr0079338"
#[2,] "FBtr0086326"
#[3,] "FBtr0100846"

sapply(strsplit(x, "/+"), function(x) paste0(grep("FBtr", trimws(x), value=TRUE), collapse = ";"))
#[1] "FBtr0079338;FBtr0086326;FBtr0100846"

编辑：

要分配给数据表中的新列：

x$FBtr <- sapply(strsplit(x$V3, "/+"), function(x) paste0(grep("FBtr", trimws(x), value=TRUE), collapse = ";"))

本质上，您可以提供包含注释的列来代替 x。

Answer 2

更具体到 data.table，并使用 stringr 包：

library(stringr)
x[, .(IDs = str_c(unlist(str_extract_all(V3, "(FBtr)[0-9]+")), 
    collapse = "/")), by = probesetID]

在 R 中提取基因注释 ID

Extracting Gene Annotation IDs in R

string

parsing

r

bioinformatics

data.table

编辑：