使用 R 中的 mutate 和 case_when() 语句用 unite() 填充列,tidyverse
Fill column with unite() using mutate and case_when() statement in R, tidy verse
我有一个名称列表和这些名称的分配阈值,以确定我是否适当分配了名称。
您可以使用此重新创建测试数据集:
df <- data.frame(level1 = c("Eukaryota","Eukaryota","Eukaryota","Eukaryota","Eukaryota"),
level2=c("Opisthokonta","Alveolata","Opisthokonta","Alveolata","Alveolata"),
level3=c("Fungi","Ciliophora","Fungi","Ciliophora","Dinoflagellata"),
level4=c("Basidiomycota","Spirotrichea","Basidiomycota","Spirotrichea","Dinophyceae"),
value = c("100;5;4;2", "100;100;100;100", "100;80;60;50", "90;50;40;40","100;80;20;0"))
我想使用 tidy verse mutate()
和 case_when()
来找到通过合适阈值的分类级别。所以下面的 tidy verse 语句打破了阈值,然后尝试这样做。
我的瓶颈
- 使用
case_when()
与 ifelse()
语句 - 使用 ifelse() 可能更合适??
- 我不知道如何使用串联的 level1-levelX fill 名为 Name_updated 的新列。现在,unite() 是不合适的,因为这与整个数据集有关。实际上我有更多的专栏,所以这样做 没有 整洁的经文
level1:level3
语法会很痛苦!
df_updated <- df %>%
separate(value, c("threshold1","threshold2", "threshold3", "threshold4"), sep =";") %>%
mutate(Name_updated = case_when(
threshold4 >= 50 ~ unite(level1:level4, sep = ";"), #Fill with all taxonomic names to level4
threshold4 < 50 & threshold3 >= 60 ~ unite(level1:level3, sep = ";"), #If last threshold is <50, only fill with taxonomic names to level3
threshold4 < 50 & threshold3 < 60 & threshold2 >= 50 ~ unite(level1:level2, sep = ";"), #If thresholds for level 3 and 4 are below, fill only level1;level2
TRUE ~ level1)) %>% #Otherwise fill with only level 1
data.frame
期望输出
> df_updated$Name_updated
# Output of this new list:
Eukaryota
Eukaryota;Alveolata;Ciliophora;Spirotrichea
Eukaryota;Opisthokonta;Fungi;Basidiomycota
Eukaryota;Alveolata
Eukaryota;Alveolata
期望的下一步是编写一个允许用户指定脚本中使用的阈值的函数。所以我真的需要让 probing/determining 什么阈值通过稳健。
问题出在 unite
以及 separate
ed 列的 type
上。默认情况下,convert = FALSE
它将是 character
class 列
library(dplyr)
library(tidyr)
library(purrr)
library(stringr)
df %>%
type.convert(as.is = TRUE) %>%
separate(value, c("threshold1","threshold2",
"threshold3", "threshold4"), sep =";", convert = TRUE) %>%
mutate(Name_updated =
case_when(
threshold4 >= 50 ~
select(., starts_with('level')) %>%
reduce(str_c, sep=";"),
threshold4 < 50 & threshold3 >= 60 ~
select(., level1:level3) %>%
reduce(str_c, sep=";"),
threshold4 < 50 & threshold3 < 60 & threshold2 >= 50 ~
select(., level1:level2) %>%
reduce(str_c, sep=";"),
TRUE ~ level1))
# level1 level2 level3 level4 threshold1 threshold2 threshold3 threshold4
#1 Eukaryota Opisthokonta Fungi Basidiomycota 100 5 4 2
#2 Eukaryota Alveolata Ciliophora Spirotrichea 100 100 100 100
#3 Eukaryota Opisthokonta Fungi Basidiomycota 100 80 60 50
#4 Eukaryota Alveolata Ciliophora Spirotrichea 90 50 40 40
#5 Eukaryota Alveolata Dinoflagellata Dinophyceae 100 80 20 0
# Name_updated
#1 Eukaryota
#2 Eukaryota;Alveolata;Ciliophora;Spirotrichea
#3 Eukaryota;Opisthokonta;Fungi;Basidiomycota
#4 Eukaryota;Alveolata
#5 Eukaryota;Alveolata
我有一个名称列表和这些名称的分配阈值,以确定我是否适当分配了名称。
您可以使用此重新创建测试数据集:
df <- data.frame(level1 = c("Eukaryota","Eukaryota","Eukaryota","Eukaryota","Eukaryota"),
level2=c("Opisthokonta","Alveolata","Opisthokonta","Alveolata","Alveolata"),
level3=c("Fungi","Ciliophora","Fungi","Ciliophora","Dinoflagellata"),
level4=c("Basidiomycota","Spirotrichea","Basidiomycota","Spirotrichea","Dinophyceae"),
value = c("100;5;4;2", "100;100;100;100", "100;80;60;50", "90;50;40;40","100;80;20;0"))
我想使用 tidy verse mutate()
和 case_when()
来找到通过合适阈值的分类级别。所以下面的 tidy verse 语句打破了阈值,然后尝试这样做。
我的瓶颈
- 使用
case_when()
与ifelse()
语句 - 使用 ifelse() 可能更合适?? - 我不知道如何使用串联的 level1-levelX fill 名为 Name_updated 的新列。现在,unite() 是不合适的,因为这与整个数据集有关。实际上我有更多的专栏,所以这样做 没有 整洁的经文
level1:level3
语法会很痛苦!
df_updated <- df %>%
separate(value, c("threshold1","threshold2", "threshold3", "threshold4"), sep =";") %>%
mutate(Name_updated = case_when(
threshold4 >= 50 ~ unite(level1:level4, sep = ";"), #Fill with all taxonomic names to level4
threshold4 < 50 & threshold3 >= 60 ~ unite(level1:level3, sep = ";"), #If last threshold is <50, only fill with taxonomic names to level3
threshold4 < 50 & threshold3 < 60 & threshold2 >= 50 ~ unite(level1:level2, sep = ";"), #If thresholds for level 3 and 4 are below, fill only level1;level2
TRUE ~ level1)) %>% #Otherwise fill with only level 1
data.frame
期望输出
> df_updated$Name_updated
# Output of this new list:
Eukaryota
Eukaryota;Alveolata;Ciliophora;Spirotrichea
Eukaryota;Opisthokonta;Fungi;Basidiomycota
Eukaryota;Alveolata
Eukaryota;Alveolata
期望的下一步是编写一个允许用户指定脚本中使用的阈值的函数。所以我真的需要让 probing/determining 什么阈值通过稳健。
问题出在 unite
以及 separate
ed 列的 type
上。默认情况下,convert = FALSE
它将是 character
class 列
library(dplyr)
library(tidyr)
library(purrr)
library(stringr)
df %>%
type.convert(as.is = TRUE) %>%
separate(value, c("threshold1","threshold2",
"threshold3", "threshold4"), sep =";", convert = TRUE) %>%
mutate(Name_updated =
case_when(
threshold4 >= 50 ~
select(., starts_with('level')) %>%
reduce(str_c, sep=";"),
threshold4 < 50 & threshold3 >= 60 ~
select(., level1:level3) %>%
reduce(str_c, sep=";"),
threshold4 < 50 & threshold3 < 60 & threshold2 >= 50 ~
select(., level1:level2) %>%
reduce(str_c, sep=";"),
TRUE ~ level1))
# level1 level2 level3 level4 threshold1 threshold2 threshold3 threshold4
#1 Eukaryota Opisthokonta Fungi Basidiomycota 100 5 4 2
#2 Eukaryota Alveolata Ciliophora Spirotrichea 100 100 100 100
#3 Eukaryota Opisthokonta Fungi Basidiomycota 100 80 60 50
#4 Eukaryota Alveolata Ciliophora Spirotrichea 90 50 40 40
#5 Eukaryota Alveolata Dinoflagellata Dinophyceae 100 80 20 0
# Name_updated
#1 Eukaryota
#2 Eukaryota;Alveolata;Ciliophora;Spirotrichea
#3 Eukaryota;Opisthokonta;Fungi;Basidiomycota
#4 Eukaryota;Alveolata
#5 Eukaryota;Alveolata