将包含分隔字符串的数据框列拆分为多列并保留拆分字符串的特定部分
Split a dataframe column containing delimited strings into multiple columns and retain specific portions of the split strings
我有一个数据框 df
,其中包含一个列 GO
。 df
中的每一行包含一个或多个术语(由 ;
分隔)并且每个术语都有特定的格式 - 它以 P、C 或 F 开头,后跟 :
然后是实际术语。
df <- data.frame(
GO = c("C:mitochondrion; C:kinetoplast", "", "F:calmodulin binding; C:cytoplasm; C:axoneme",
"", "P:cilium movement; P:inner dynein arm assembly; C:axoneme", "", "F:calcium ion binding"))
GO
1 C:mitochondrion; C:kinetoplast
2
3 F:calmodulin binding; C:cytoplasm; C:axoneme
4
5 P:cilium movement; P:inner dynein arm assembly; C:axoneme
6
7 F:calcium ion binding
我想根据术语是否以 P
、C
开头,将此列拆分为三列 BP
、CC
、MF
或 F
分别。另外我希望这三列只有术语而不是其他标识符(P、C、F 和 :
)。
这就是我希望新数据框的样子:
BP CC MF
1 mitochondrion; kinetoplast
2
3 cytoplasm; axoneme calmodulin binding
4
5 cilium movement; inner dynein arm assembly axoneme
6
7 calcium ion binding
实现您想要的结果的tidyverse
方法可能如下所示:
library(tidyr)
library(dplyr)
df %>%
mutate(id = seq(nrow(.))) %>%
separate_rows(GO, sep = ";\s") %>%
separate(GO, into = c("category", "item"), sep = ":") %>%
mutate(category = recode(category, C = "CC", P = "BP", F = "MF", .default = "foo")) %>%
replace_na(list(item = "")) %>%
group_by(id, category) %>%
summarise(items = paste(item, collapse = "; "), .groups = "drop") %>%
pivot_wider(names_from = category, values_from = items, values_fill = "") %>%
select(BP, CC, MF)
#> Warning: Expected 2 pieces. Missing pieces filled with `NA` in 3 rows [3, 7,
#> 11].
#> # A tibble: 7 × 3
#> BP CC MF
#> <chr> <chr> <chr>
#> 1 "" "mitochondrion; kinetoplas… ""
#> 2 "" "" ""
#> 3 "" "cytoplasm; axoneme" "cal…
#> 4 "" "" ""
#> 5 "cilium movement; inner dynein arm assembly" "axoneme" ""
#> 6 "" "" ""
#> 7 "" "" "cal…
这里还有一个:
- 使用
row_number
创建标识符
- 使用
separate_rows
将每个项目放在一行中
- 在
case_when
中使用str_detect
准备列名
- 删除项目的开头,例如 'C:' 'F:' 和 'P:'
- 分组并折叠成一行
- 获取不同的值并删除 NA
- 应用
pivot_wider
和 select 列
library(tidyverse)
df %>%
mutate(row = row_number()) %>%
separate_rows(GO, sep = '; ') %>%
mutate(names = case_when(str_detect(GO, 'C:')~"CC",
str_detect(GO, 'F:')~"MF",
str_detect(GO, 'P:')~"BP",
TRUE ~ NA_character_)) %>%
mutate(GO = str_replace_all(GO, '.\:', '')) %>%
group_by(row, names) %>%
mutate(b_x = paste(GO, collapse = "; ")) %>%
distinct(b_x) %>%
na.omit() %>%
pivot_wider(
names_from = names,
values_from = b_x
) %>%
ungroup() %>%
select(BP, CC, MF)
BP CC MF
<chr> <chr> <chr>
1 NA mitochondrion; kinetoplast NA
2 NA cytoplasm; axoneme calmodulin binding
3 cilium movement; inner dynein arm assembly axoneme NA
4 NA NA calcium ion binding
另一个可能的解决方案:
library(tidyverse)
df %>%
rownames_to_column("id") %>%
separate_rows(GO, sep = "; ") %>%
separate(GO, into = c("name", "value"), sep = ":", fill = "right") %>%
filter(complete.cases(.)) %>%
pivot_wider(id_cols = id, values_fn = list) %>% rowwise %>%
mutate(across(-id, ~ str_c(.x, collapse = "; "))) %>%
left_join(data.frame(id = seq(nrow(df)) %>% as.character), .) %>%
mutate(across(everything(), replace_na, "")) %>%
select(BP = P, CC = C, MF = F)
#> Joining, by = "id"
#> BP CC
#> 1 mitochondrion; kinetoplast
#> 2
#> 3 cytoplasm; axoneme
#> 4
#> 5 cilium movement; inner dynein arm assembly axoneme
#> 6
#> 7
#> MF
#> 1
#> 2
#> 3 calmodulin binding
#> 4
#> 5
#> 6
#> 7 calcium ion binding
我有一个数据框 df
,其中包含一个列 GO
。 df
中的每一行包含一个或多个术语(由 ;
分隔)并且每个术语都有特定的格式 - 它以 P、C 或 F 开头,后跟 :
然后是实际术语。
df <- data.frame(
GO = c("C:mitochondrion; C:kinetoplast", "", "F:calmodulin binding; C:cytoplasm; C:axoneme",
"", "P:cilium movement; P:inner dynein arm assembly; C:axoneme", "", "F:calcium ion binding"))
GO
1 C:mitochondrion; C:kinetoplast
2
3 F:calmodulin binding; C:cytoplasm; C:axoneme
4
5 P:cilium movement; P:inner dynein arm assembly; C:axoneme
6
7 F:calcium ion binding
我想根据术语是否以 P
、C
开头,将此列拆分为三列 BP
、CC
、MF
或 F
分别。另外我希望这三列只有术语而不是其他标识符(P、C、F 和 :
)。
这就是我希望新数据框的样子:
BP CC MF
1 mitochondrion; kinetoplast
2
3 cytoplasm; axoneme calmodulin binding
4
5 cilium movement; inner dynein arm assembly axoneme
6
7 calcium ion binding
实现您想要的结果的tidyverse
方法可能如下所示:
library(tidyr)
library(dplyr)
df %>%
mutate(id = seq(nrow(.))) %>%
separate_rows(GO, sep = ";\s") %>%
separate(GO, into = c("category", "item"), sep = ":") %>%
mutate(category = recode(category, C = "CC", P = "BP", F = "MF", .default = "foo")) %>%
replace_na(list(item = "")) %>%
group_by(id, category) %>%
summarise(items = paste(item, collapse = "; "), .groups = "drop") %>%
pivot_wider(names_from = category, values_from = items, values_fill = "") %>%
select(BP, CC, MF)
#> Warning: Expected 2 pieces. Missing pieces filled with `NA` in 3 rows [3, 7,
#> 11].
#> # A tibble: 7 × 3
#> BP CC MF
#> <chr> <chr> <chr>
#> 1 "" "mitochondrion; kinetoplas… ""
#> 2 "" "" ""
#> 3 "" "cytoplasm; axoneme" "cal…
#> 4 "" "" ""
#> 5 "cilium movement; inner dynein arm assembly" "axoneme" ""
#> 6 "" "" ""
#> 7 "" "" "cal…
这里还有一个:
- 使用
row_number
创建标识符
- 使用
separate_rows
将每个项目放在一行中 - 在
case_when
中使用str_detect
准备列名 - 删除项目的开头,例如 'C:' 'F:' 和 'P:'
- 分组并折叠成一行
- 获取不同的值并删除 NA
- 应用
pivot_wider
和 select 列
library(tidyverse)
df %>%
mutate(row = row_number()) %>%
separate_rows(GO, sep = '; ') %>%
mutate(names = case_when(str_detect(GO, 'C:')~"CC",
str_detect(GO, 'F:')~"MF",
str_detect(GO, 'P:')~"BP",
TRUE ~ NA_character_)) %>%
mutate(GO = str_replace_all(GO, '.\:', '')) %>%
group_by(row, names) %>%
mutate(b_x = paste(GO, collapse = "; ")) %>%
distinct(b_x) %>%
na.omit() %>%
pivot_wider(
names_from = names,
values_from = b_x
) %>%
ungroup() %>%
select(BP, CC, MF)
BP CC MF
<chr> <chr> <chr>
1 NA mitochondrion; kinetoplast NA
2 NA cytoplasm; axoneme calmodulin binding
3 cilium movement; inner dynein arm assembly axoneme NA
4 NA NA calcium ion binding
另一个可能的解决方案:
library(tidyverse)
df %>%
rownames_to_column("id") %>%
separate_rows(GO, sep = "; ") %>%
separate(GO, into = c("name", "value"), sep = ":", fill = "right") %>%
filter(complete.cases(.)) %>%
pivot_wider(id_cols = id, values_fn = list) %>% rowwise %>%
mutate(across(-id, ~ str_c(.x, collapse = "; "))) %>%
left_join(data.frame(id = seq(nrow(df)) %>% as.character), .) %>%
mutate(across(everything(), replace_na, "")) %>%
select(BP = P, CC = C, MF = F)
#> Joining, by = "id"
#> BP CC
#> 1 mitochondrion; kinetoplast
#> 2
#> 3 cytoplasm; axoneme
#> 4
#> 5 cilium movement; inner dynein arm assembly axoneme
#> 6
#> 7
#> MF
#> 1
#> 2
#> 3 calmodulin binding
#> 4
#> 5
#> 6
#> 7 calcium ion binding