使用正则表达式拆分列中的值

Question

我有 data.frame 两列，如下所示

dat

    ID                             Details                         
    id_1        box1_homodomain gn=box1 os=homo sapiens p=4 se=1   
    id_2        sox2_plurinet gn=plu os=mus musculus p=5 se=3

我想在 "Details" 列中拆分 "os=xxx" 和 gn="yyy" 以获取所有 ID，并按如下方式打印：

    Id   Description        gn      os               
   Îd_1  box1_homodomain    box1    homo sapiens   
   Id_2  sox2_plurinet      plu     mouse musculus

我尝试在 R 中使用 gsub 方法，但我无法将 os=homo sapiens 和 gn=box1 拆分到它们各自的列中。我使用的以下 R 代码

dat$gn=gsub('^[gn=][A-z][A-z]`,dat$Details)
dat$os=gsub('^[os=][A-z][A-z]`,dat$Details)

谁能告诉我哪里错了，如何改正。请帮助我。

提前致谢

Answer 1

这是 tidyr 的一个选项：

library(tidyr)
# specify the new column names:
vars <- c("Description", "gn", "os")
# then separate the "Details" column according to regex and drop extra columns:
separate(dat, Details, into = vars, sep = "[A-Za-z]+=", extra = "drop")
#    ID      Description    gn            os
#1 id_1 box1_homodomain  box1  homo sapiens 
#2 id_2   sox2_plurinet   plu  mus musculus

Answer 2

1) sub 和 gsub 要像问题中那样使用 sub 和 gsub 来尝试这个。请注意，每个正则表达式都应匹配所有 dat$Details ，以便当我们将其替换为捕获组时，仅保留捕获组。对于问题评论中的 dat$GO，我们删除了 P: 之前的所有内容，但不包括 P:，用逗号替换所有出现的 ;P 并删除 P: 以及删除分号及其后的所有内容。 F 和 C 类似：

data.frame(dat[1], 
   Description = sub(" .*", "", dat$Details),
   gn = sub(".*gn=(.*) os=.*", "\1", dat$Details),
   os = sub(".*os=(.*) p=.*", "\1", dat$Details),
   P = gsub("P:|;.*", "", gsub(";P:", ",", sub(".*?P:", "P:", dat$GO))),
   F = gsub("F:|;.*", "", gsub(";F:", ",", sub(".*?F:", "F:", dat$GO))),
   C = gsub("C:|;.*", "", gsub(";C:", ",", sub(".*?C:", "C:", dat$GO))))

给予：

    ID     Description   gn           os       P       F       C
1 id_1 box1_homodomain box1 homo sapiens p_1,p_2     F_1 C_1,C_2
2 id_2   sox2_plurinet  plu mus musculus     p_1 F_1,F_2     C_1

2) read.pattern 使用 read.pattern (link) in the gsubfn package as one can define a single regular expression whose capture groups represent the fields of interest. Processing of dat$GO can be simplified too by extracting the P:... fields using strapplyc (link) 对 dat$Details 的处理要容易一些，然后将它们与paste（与 F 和 C 字段类似）：

library(gsubfn)

Sub <- function(string, pat) sapply(strapplyc(string, pat), paste, collapse = ",")

DF <- read.pattern(text = as.character(dat$Details), 
        pattern = "(.*) gn=(.*) os=(.*) p=",
        col.names = c("Description", "gn", "os"),
        as.is = TRUE)

cbind(dat[1], DF,
      P = Sub(dat$GO, "P:(.*?);"),
      F = Sub(dat$GO, "F:(.*?);"),
      C = Sub(dat$GO, "C:(.*?);"))

给予：

    ID     Description   gn           os       P       F       C
1 id_1 box1_homodomain box1 homo sapiens p_1,p_2     F_1 C_1,C_2
2 id_2   sox2_plurinet  plu mus musculus     p_1 F_1,F_2     C_1

这是 read.pattern 中使用的正则表达式的可视化：

(.*) gn=(.*) os=(.*) p=

Debuggex Demo

注释

1) 如果 dat$Details 列已经是字符，我们可以省略 as.character。如果可以在结果中包含 factor 列，我们也可以省略 as.is=TRUE。

2) 问题中的示例输出有 mouse 但输入有 mus。我们假设在这两种情况下它都应该是 mus。

3) 我们将其用于 dat:

dat <-
structure(list(ID = c("id_1", "id_2"), 
Details = c("box1_homodomain gn=box1 os=homo sapiens p=4 se=1", 
"sox2_plurinet gn=plu os=mus musculus p=5 se=3"), 
GO = c("P:p_1;P:p_2;F:F_1;C:C_1;C:C_2;  ", 
"P:p_1;F:F_1;F:F_2;C:C_1;")), .Names = c("ID", "Details", 
"GO"), class = "data.frame", row.names = c(NA, -2L))

以后请post问题dput(dat)的结果

Answer 3

您也可以为此使用正则表达式捕获组。每个捕获组匹配可以用例如提取。 stringi 包中的 stri_match_first_regex 函数。

dat <- data.frame(
   ID=c("id_1", "id_2"),
   details=c("box1_homodomain gn=box1 os=homo sapiens p=4 se=1", "sox2_plurinet gn=plu os=mus musculus p=5 se=3")
)

library(stringi)
res <- stri_match_first_regex(dat$details, "^(.+) gn=(.+) os=(.+) p=.*$")
res[,1] <- dat$ID
res <- as.data.frame(res)
names(res) <- c("ID", "Description", "gn", "os")
res
##   ID     Description   gn           os
## 1  1 box1_homodomain box1 homo sapiens
## 2  2   sox2_plurinet  plu mus musculus

使用正则表达式拆分列中的值

Splitting the values in column using regex

regex

split

r

gsub