使用 tidyr 从列中提取值
extracting values from column using tidyr
我将 data.frame annot
定义为:
annot <- structure(list(Name = c("dd_1", "dd_2", "dd_3","dd_4", "dd_5", "dd_6","dd_7"), GOs =
c("C:extracellular space; C:cell body; P:cell migration process; P:NF/ß pathway",
"C:Signal transduction; C:nucleus; F:positive regulation; P:single organism; P:positive(+) regulation",
"C:cardiomyceltes; C:intracellular pace; F:putative; F:magnesium ion binding; F:calcium ion binding; P:visual perception; P:blood coagulation",
"F:poly(A) RNA binding; P:DNA-templated transcription, initiation",
"C:ULK1-ATG13-FIP200 complex; F:histone-arginine N-methyltransferase activity; P:single-organism cellular process",
"F:3'-5' DNA helicase activity; P:acetate-CoA ligase activity",
"F:UDP-N-acetylmuramoylalanyl-D-glutamyl-2,6-diaminopimelate-D-alanyl-D-alanine ligase activity; P:oxidoreductase activity, acting on the aldehyde or oxo group of donors, NAD or NADP as acceptor"
)), .Names = c("Name", "GOs"), class = "data.frame", row.names = c(NA,
-7L))
data.frame 看起来像这样:
Name GOs
dd_1 C:extracellular space; C:cell body; P:cell migration process; P:NF/ß pathway
dd_2 C:Signal transduction; C:nucleus; F:positive regulation; P:single organism; P:positive(+) regulation
dd_3 C:cardiomyceltes; C:intracellular pace; F:putative; F:magnesium ion binding; F:calcium ion binding; P:visual perception; P:blood coagulation
dd_4 F:poly(A) RNA binding; P:DNA-templated transcription, initiation
dd_5 C:ULK1-ATG13-FIP200 complex; F:histone-arginine N-methyltransferase activity; P:single-organism cellular process
dd_6 F:3'-5' DNA helicase activity; P:acetate-CoA ligase activity
dd_7 F:UDP-N-acetylmuramoylalanyl-D-glutamyl-2,6-diaminopimelate-D-alanyl-D-alanine ligase activity; P:oxidoreductase activity, acting on the aldehyde or oxo group of donors, NAD or NADP as acceptor
每个条目都包含 C、F、P 中的单词、特殊字符、字母数字字符。我想将对应于 C:xxx;F:yyy:P:zzz
的所有值拆分为单独的列,其对应的值如下所示:
Name Component Function P
dd_1 C:extracellular space;C:cell body F:transport carrier P:cell migration process;P:NF/ß pathway
dd_2 C:Signal transduction;C:nucleus F:positive regulation P:single organism;P:positive regulation
dd_3 C:cardiomyceltes;C:intracellular pace F:magnesium ion P:visual perception;P:blood coagulationbinding;F:calcium ion binding;
dd_4 F:poly(A) RNA binding; P:DNA-templated transcription, initiation
dd_5 C:ULK1-ATG13-FIP200 complex F:histone-arginine N-methyltransferase activity P:single-organism cellular process
dd_6 F:3'-5' DNA helicase activity; P:acetate-CoA ligase activity
dd_7 F:UDP-N-acetylmuramoylalanyl-D-glutamyl-2,6-diaminopimelate-D-alanyl-D-alanine ligase activity P:oxidoreductase activity, acting on the aldehyde or oxo group of donors, NAD or NADP as acceptor
我尝试使用 tidyr 在 R 中执行以下命令
separate(annot, GOs, into = c("P", "F", "C"), sep = "[a-z]+=")
但它返回了以下错误:
Error: Values not split into 3 pieces at 1, 2, 3,4
你可以试试strsplit
res <- do.call(rbind.data.frame,lapply(strsplit(annot$GOs, ";"),
function(x) tapply(x, sub(':.*', '', x), FUN=paste, collapse=";")))
res1 <- data.frame(Name=annot[,1], setNames(res, c('Component',
'Function', 'P')), stringsAsFactors=FALSE)
res1
# Name Component
#1 dd_1 C:extracellular space;C:cell body
#2 dd_2 C:Signal transduction;C:nucleus
#3 dd_3 C:cardiomyceltes;C:intracellular pace
# Function
#1 F:transport carrier
#2 F:positive regulation
#3 F:putative;F:magnesium ion binding;F:calcium ion binding
# P
#1 P:cell migration process;P:NF/ß pathway
#2 P:single organism;P:positive regulation
#3 P:visual perception;P:blood coagulation
或者您可以尝试从 tidyr
extract
library(tidyr)
extract(annot, GOs, c('C', 'F', 'P'), '(C:[^F]+);(F:[^P]+);(P:.*)')
# Name C
#1 dd_1 C:extracellular space;C:cell body
#2 dd_2 C:Signal transduction;C:nucleus
#3 dd_3 C:cardiomyceltes;C:intracellular pace
# F
#1 F:transport carrier
#2 F:positive regulation
#3 F:putative;F:magnesium ion binding;F:calcium ion binding
# P
#1 P:cell migration process;P:NF/ß pathway
#2 P:single organism;P:positive regulation
#3 P:visual perception;P:blood coagulation
更新
新数据集的每一行都缺少一些元素(即 "C" 、 "F" 等)。您可以修改第一个解决方案
res <- do.call(rbind.data.frame,lapply(strsplit(annot$GOs, "; "),function(x){
x1 <- tapply(x, sub(':.*', '', x), FUN=paste, collapse=";")
x1[match(c('C', 'F', 'P'), names(x1))]}))
res1 <- data.frame(Name=annot[,1], setNames(res, c('Component',
'Function', 'P')), stringsAsFactors=FALSE)
head(res1,2)
# Name Component Function
#1 dd_1 C:extracellular space;C:cell body <NA>
#2 dd_2 C:Signal transduction;C:nucleus F:positive regulation
# P
#1 P:cell migration process;P:NF/ß pathway
#2 P:single organism;P:positive(+) regulation
我认为你最好使用像这样的整洁格式:
library(tidyr)
library(dplyr)
annot %>%
tbl_df() %>%
mutate(GOs = strsplit(GOs, "; ")) %>% # split each GO into a vector
unnest(GOs) %>% # unnest the vectors into multiple rows
separate(GOs, c("type", "value"), ":")
#> Source: local data frame [25 x 3]
#>
#> Name type value
#> 1 dd_1 C extracellular space
#> 2 dd_1 C cell body
#> 3 dd_1 P cell migration process
#> 4 dd_1 P NF/ß pathway
#> 5 dd_2 C Signal transduction
#> 6 dd_2 C nucleus
#> 7 dd_2 F positive regulation
#> 8 dd_2 P single organism
#> 9 dd_2 P positive(+) regulation
#> 10 dd_3 C cardiomyceltes
#> .. ... ... ...
我将 data.frame annot
定义为:
annot <- structure(list(Name = c("dd_1", "dd_2", "dd_3","dd_4", "dd_5", "dd_6","dd_7"), GOs =
c("C:extracellular space; C:cell body; P:cell migration process; P:NF/ß pathway",
"C:Signal transduction; C:nucleus; F:positive regulation; P:single organism; P:positive(+) regulation",
"C:cardiomyceltes; C:intracellular pace; F:putative; F:magnesium ion binding; F:calcium ion binding; P:visual perception; P:blood coagulation",
"F:poly(A) RNA binding; P:DNA-templated transcription, initiation",
"C:ULK1-ATG13-FIP200 complex; F:histone-arginine N-methyltransferase activity; P:single-organism cellular process",
"F:3'-5' DNA helicase activity; P:acetate-CoA ligase activity",
"F:UDP-N-acetylmuramoylalanyl-D-glutamyl-2,6-diaminopimelate-D-alanyl-D-alanine ligase activity; P:oxidoreductase activity, acting on the aldehyde or oxo group of donors, NAD or NADP as acceptor"
)), .Names = c("Name", "GOs"), class = "data.frame", row.names = c(NA,
-7L))
data.frame 看起来像这样:
Name GOs
dd_1 C:extracellular space; C:cell body; P:cell migration process; P:NF/ß pathway
dd_2 C:Signal transduction; C:nucleus; F:positive regulation; P:single organism; P:positive(+) regulation
dd_3 C:cardiomyceltes; C:intracellular pace; F:putative; F:magnesium ion binding; F:calcium ion binding; P:visual perception; P:blood coagulation
dd_4 F:poly(A) RNA binding; P:DNA-templated transcription, initiation
dd_5 C:ULK1-ATG13-FIP200 complex; F:histone-arginine N-methyltransferase activity; P:single-organism cellular process
dd_6 F:3'-5' DNA helicase activity; P:acetate-CoA ligase activity
dd_7 F:UDP-N-acetylmuramoylalanyl-D-glutamyl-2,6-diaminopimelate-D-alanyl-D-alanine ligase activity; P:oxidoreductase activity, acting on the aldehyde or oxo group of donors, NAD or NADP as acceptor
每个条目都包含 C、F、P 中的单词、特殊字符、字母数字字符。我想将对应于 C:xxx;F:yyy:P:zzz
的所有值拆分为单独的列,其对应的值如下所示:
Name Component Function P
dd_1 C:extracellular space;C:cell body F:transport carrier P:cell migration process;P:NF/ß pathway
dd_2 C:Signal transduction;C:nucleus F:positive regulation P:single organism;P:positive regulation
dd_3 C:cardiomyceltes;C:intracellular pace F:magnesium ion P:visual perception;P:blood coagulationbinding;F:calcium ion binding;
dd_4 F:poly(A) RNA binding; P:DNA-templated transcription, initiation
dd_5 C:ULK1-ATG13-FIP200 complex F:histone-arginine N-methyltransferase activity P:single-organism cellular process
dd_6 F:3'-5' DNA helicase activity; P:acetate-CoA ligase activity
dd_7 F:UDP-N-acetylmuramoylalanyl-D-glutamyl-2,6-diaminopimelate-D-alanyl-D-alanine ligase activity P:oxidoreductase activity, acting on the aldehyde or oxo group of donors, NAD or NADP as acceptor
我尝试使用 tidyr 在 R 中执行以下命令
separate(annot, GOs, into = c("P", "F", "C"), sep = "[a-z]+=")
但它返回了以下错误:
Error: Values not split into 3 pieces at 1, 2, 3,4
你可以试试strsplit
res <- do.call(rbind.data.frame,lapply(strsplit(annot$GOs, ";"),
function(x) tapply(x, sub(':.*', '', x), FUN=paste, collapse=";")))
res1 <- data.frame(Name=annot[,1], setNames(res, c('Component',
'Function', 'P')), stringsAsFactors=FALSE)
res1
# Name Component
#1 dd_1 C:extracellular space;C:cell body
#2 dd_2 C:Signal transduction;C:nucleus
#3 dd_3 C:cardiomyceltes;C:intracellular pace
# Function
#1 F:transport carrier
#2 F:positive regulation
#3 F:putative;F:magnesium ion binding;F:calcium ion binding
# P
#1 P:cell migration process;P:NF/ß pathway
#2 P:single organism;P:positive regulation
#3 P:visual perception;P:blood coagulation
或者您可以尝试从 tidyr
extract
library(tidyr)
extract(annot, GOs, c('C', 'F', 'P'), '(C:[^F]+);(F:[^P]+);(P:.*)')
# Name C
#1 dd_1 C:extracellular space;C:cell body
#2 dd_2 C:Signal transduction;C:nucleus
#3 dd_3 C:cardiomyceltes;C:intracellular pace
# F
#1 F:transport carrier
#2 F:positive regulation
#3 F:putative;F:magnesium ion binding;F:calcium ion binding
# P
#1 P:cell migration process;P:NF/ß pathway
#2 P:single organism;P:positive regulation
#3 P:visual perception;P:blood coagulation
更新
新数据集的每一行都缺少一些元素(即 "C" 、 "F" 等)。您可以修改第一个解决方案
res <- do.call(rbind.data.frame,lapply(strsplit(annot$GOs, "; "),function(x){
x1 <- tapply(x, sub(':.*', '', x), FUN=paste, collapse=";")
x1[match(c('C', 'F', 'P'), names(x1))]}))
res1 <- data.frame(Name=annot[,1], setNames(res, c('Component',
'Function', 'P')), stringsAsFactors=FALSE)
head(res1,2)
# Name Component Function
#1 dd_1 C:extracellular space;C:cell body <NA>
#2 dd_2 C:Signal transduction;C:nucleus F:positive regulation
# P
#1 P:cell migration process;P:NF/ß pathway
#2 P:single organism;P:positive(+) regulation
我认为你最好使用像这样的整洁格式:
library(tidyr)
library(dplyr)
annot %>%
tbl_df() %>%
mutate(GOs = strsplit(GOs, "; ")) %>% # split each GO into a vector
unnest(GOs) %>% # unnest the vectors into multiple rows
separate(GOs, c("type", "value"), ":")
#> Source: local data frame [25 x 3]
#>
#> Name type value
#> 1 dd_1 C extracellular space
#> 2 dd_1 C cell body
#> 3 dd_1 P cell migration process
#> 4 dd_1 P NF/ß pathway
#> 5 dd_2 C Signal transduction
#> 6 dd_2 C nucleus
#> 7 dd_2 F positive regulation
#> 8 dd_2 P single organism
#> 9 dd_2 P positive(+) regulation
#> 10 dd_3 C cardiomyceltes
#> .. ... ... ...