拆分不同长度的字符串并根据匹配粘贴到数据框中的特定列
Split strings of different lengths and paste in specific column in a dataframe based on match
我有一个包含不同长度字符串的向量:
该向量如下例所示:
TX <- c("d_Bacteria|g_Thermobaculum", "d_Bacteria|p_Acidobacteria|c_Acidobacteria subdivision|f_Vicinamibacteraceae|g_Luteitalea", "d_Bacteria|p_Acidobacteria|c_Acidobacteriia|o_Acidobacteriales|f_Acidobacteriaceae|g_Acidobacterium", "d_Bacteria|p_Acidobacteria|c_Acidobacteriia|o_Acidobacteriales|f_Acidobacteriaceae|g_Candidatus Koribacter", "d_Bacteria|p_Acidobacteria|c_Acidobacteriia|o_Acidobacteriales|f_Acidobacteriaceae|g_Granulicella", "d_Bacteria|p_Acidobacteria|c_Acidobacteriia|o_Acidobacteriales|f_Acidobacteriaceae|g_Terriglobus")
我需要制作一个数据框来根据分类注释划分每个字符串:"domain","phylum","class","order","family","genus"
我试过了:
taxon <- str_split(clade_names, "\|", simplify = T)
它可以完美地分割它,但它会从左到右填充数据框,我需要根据分类级别填充它。
我认为我需要使用 grepl
来匹配“d_”、“p_”、“c_”、“o_”、“f_”、“g_”
但是我不知道如何正确地写它。
非常感谢您的帮助。
使用data.table,在"|"
上拆分,从宽到长整形,然后在"_"
上拆分以获得分类注释组,然后从长到宽整形:
library(data.table)
TX <- c("d_Bacteria|g_Thermobaculum", "d_Bacteria|p_Acidobacteria|c_Acidobacteria subdivision|f_Vicinamibacteraceae|g_Luteitalea", "d_Bacteria|p_Acidobacteria|c_Acidobacteriia|o_Acidobacteriales|f_Acidobacteriaceae|g_Acidobacterium", "d_Bacteria|p_Acidobacteria|c_Acidobacteriia|o_Acidobacteriales|f_Acidobacteriaceae|g_Candidatus Koribacter", "d_Bacteria|p_Acidobacteria|c_Acidobacteriia|o_Acidobacteriales|f_Acidobacteriaceae|g_Granulicella", "d_Bacteria|p_Acidobacteria|c_Acidobacteriia|o_Acidobacteriales|f_Acidobacteriaceae|g_Terriglobus")
taxon <- data.table(x = TX)
taxon[, tstrsplit(x, "|", fixed = TRUE)
][, rn := seq_len(.N)
][, melt(.SD, id.var = "rn")
][, c("grp", "name") := tstrsplit(value, "_")
][!is.na(value), dcast(.SD, rn ~ grp, value.var = "value")]
# rn c d f g o p
# 1: 1 <NA> d_Bacteria <NA> g_Thermobaculum <NA> <NA>
# 2: 2 c_Acidobacteria subdivision d_Bacteria f_Vicinamibacteraceae g_Luteitalea <NA> p_Acidobacteria
# 3: 3 c_Acidobacteriia d_Bacteria f_Acidobacteriaceae g_Acidobacterium o_Acidobacteriales p_Acidobacteria
# 4: 4 c_Acidobacteriia d_Bacteria f_Acidobacteriaceae g_Candidatus Koribacter o_Acidobacteriales p_Acidobacteria
# 5: 5 c_Acidobacteriia d_Bacteria f_Acidobacteriaceae g_Granulicella o_Acidobacteriales p_Acidobacteria
# 6: 6 c_Acidobacteriia d_Bacteria f_Acidobacteriaceae g_Terriglobus o_Acidobacteriales p_Acidobacteria
这是一个 tidyverse 解决方案(我猜你更喜欢它,因为你已经在使用 str_split 函数):
library(tidyverse)
TX <- data.frame(clade_names = c("d_Bacteria|g_Thermobaculum", "d_Bacteria|p_Acidobacteria|c_Acidobacteria subdivision|f_Vicinamibacteraceae|g_Luteitalea", "d_Bacteria|p_Acidobacteria|c_Acidobacteriia|o_Acidobacteriales|f_Acidobacteriaceae|g_Acidobacterium", "d_Bacteria|p_Acidobacteria|c_Acidobacteriia|o_Acidobacteriales|f_Acidobacteriaceae|g_Candidatus Koribacter", "d_Bacteria|p_Acidobacteria|c_Acidobacteriia|o_Acidobacteriales|f_Acidobacteriaceae|g_Granulicella", "d_Bacteria|p_Acidobacteria|c_Acidobacteriia|o_Acidobacteriales|f_Acidobacteriaceae|g_Terriglobus"))
TX2 <- TX %>%
mutate(splits = str_split(clade_names, "\|")) %>%
unnest_wider(splits) %>%
pivot_longer(cols = -clade_names) %>%
mutate(name = str_sub(value, 1, 2)) %>%
filter(!is.na(name)) %>%
pivot_wider()
给出:
# A tibble: 6 x 7
clade_names d_ g_ p_ c_ f_ o_
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 d_Bacteria|g_Thermobaculum d_Bacte~ g_Thermobac~ NA NA NA NA
2 d_Bacteria|p_Acidobacteria|c_Acidobacteria subdiv~ d_Bacte~ g_Luteitalea p_Acidob~ c_Acidobacte~ f_Vicinami~ NA
3 d_Bacteria|p_Acidobacteria|c_Acidobacteriia|o_Aci~ d_Bacte~ g_Acidobact~ p_Acidob~ c_Acidobacte~ f_Acidobac~ o_Acidoba~
4 d_Bacteria|p_Acidobacteria|c_Acidobacteriia|o_Aci~ d_Bacte~ g_Candidatu~ p_Acidob~ c_Acidobacte~ f_Acidobac~ o_Acidoba~
5 d_Bacteria|p_Acidobacteria|c_Acidobacteriia|o_Aci~ d_Bacte~ g_Granulice~ p_Acidob~ c_Acidobacte~ f_Acidobac~ o_Acidoba~
6 d_Bacteria|p_Acidobacteria|c_Acidobacteriia|o_Aci~ d_Bacte~ g_Terriglob~ p_Acidob~ c_Acidobacte~ f_Acidobac~ o_Acidoba~
您当然可以进一步调整此代码以提供更有意义的名称或从 clade_names 等中删除 X_ 部分。
我有一个包含不同长度字符串的向量: 该向量如下例所示:
TX <- c("d_Bacteria|g_Thermobaculum", "d_Bacteria|p_Acidobacteria|c_Acidobacteria subdivision|f_Vicinamibacteraceae|g_Luteitalea", "d_Bacteria|p_Acidobacteria|c_Acidobacteriia|o_Acidobacteriales|f_Acidobacteriaceae|g_Acidobacterium", "d_Bacteria|p_Acidobacteria|c_Acidobacteriia|o_Acidobacteriales|f_Acidobacteriaceae|g_Candidatus Koribacter", "d_Bacteria|p_Acidobacteria|c_Acidobacteriia|o_Acidobacteriales|f_Acidobacteriaceae|g_Granulicella", "d_Bacteria|p_Acidobacteria|c_Acidobacteriia|o_Acidobacteriales|f_Acidobacteriaceae|g_Terriglobus")
我需要制作一个数据框来根据分类注释划分每个字符串:"domain","phylum","class","order","family","genus"
我试过了:
taxon <- str_split(clade_names, "\|", simplify = T)
它可以完美地分割它,但它会从左到右填充数据框,我需要根据分类级别填充它。
我认为我需要使用 grepl
来匹配“d_”、“p_”、“c_”、“o_”、“f_”、“g_”
但是我不知道如何正确地写它。
非常感谢您的帮助。
使用data.table,在"|"
上拆分,从宽到长整形,然后在"_"
上拆分以获得分类注释组,然后从长到宽整形:
library(data.table)
TX <- c("d_Bacteria|g_Thermobaculum", "d_Bacteria|p_Acidobacteria|c_Acidobacteria subdivision|f_Vicinamibacteraceae|g_Luteitalea", "d_Bacteria|p_Acidobacteria|c_Acidobacteriia|o_Acidobacteriales|f_Acidobacteriaceae|g_Acidobacterium", "d_Bacteria|p_Acidobacteria|c_Acidobacteriia|o_Acidobacteriales|f_Acidobacteriaceae|g_Candidatus Koribacter", "d_Bacteria|p_Acidobacteria|c_Acidobacteriia|o_Acidobacteriales|f_Acidobacteriaceae|g_Granulicella", "d_Bacteria|p_Acidobacteria|c_Acidobacteriia|o_Acidobacteriales|f_Acidobacteriaceae|g_Terriglobus")
taxon <- data.table(x = TX)
taxon[, tstrsplit(x, "|", fixed = TRUE)
][, rn := seq_len(.N)
][, melt(.SD, id.var = "rn")
][, c("grp", "name") := tstrsplit(value, "_")
][!is.na(value), dcast(.SD, rn ~ grp, value.var = "value")]
# rn c d f g o p
# 1: 1 <NA> d_Bacteria <NA> g_Thermobaculum <NA> <NA>
# 2: 2 c_Acidobacteria subdivision d_Bacteria f_Vicinamibacteraceae g_Luteitalea <NA> p_Acidobacteria
# 3: 3 c_Acidobacteriia d_Bacteria f_Acidobacteriaceae g_Acidobacterium o_Acidobacteriales p_Acidobacteria
# 4: 4 c_Acidobacteriia d_Bacteria f_Acidobacteriaceae g_Candidatus Koribacter o_Acidobacteriales p_Acidobacteria
# 5: 5 c_Acidobacteriia d_Bacteria f_Acidobacteriaceae g_Granulicella o_Acidobacteriales p_Acidobacteria
# 6: 6 c_Acidobacteriia d_Bacteria f_Acidobacteriaceae g_Terriglobus o_Acidobacteriales p_Acidobacteria
这是一个 tidyverse 解决方案(我猜你更喜欢它,因为你已经在使用 str_split 函数):
library(tidyverse)
TX <- data.frame(clade_names = c("d_Bacteria|g_Thermobaculum", "d_Bacteria|p_Acidobacteria|c_Acidobacteria subdivision|f_Vicinamibacteraceae|g_Luteitalea", "d_Bacteria|p_Acidobacteria|c_Acidobacteriia|o_Acidobacteriales|f_Acidobacteriaceae|g_Acidobacterium", "d_Bacteria|p_Acidobacteria|c_Acidobacteriia|o_Acidobacteriales|f_Acidobacteriaceae|g_Candidatus Koribacter", "d_Bacteria|p_Acidobacteria|c_Acidobacteriia|o_Acidobacteriales|f_Acidobacteriaceae|g_Granulicella", "d_Bacteria|p_Acidobacteria|c_Acidobacteriia|o_Acidobacteriales|f_Acidobacteriaceae|g_Terriglobus"))
TX2 <- TX %>%
mutate(splits = str_split(clade_names, "\|")) %>%
unnest_wider(splits) %>%
pivot_longer(cols = -clade_names) %>%
mutate(name = str_sub(value, 1, 2)) %>%
filter(!is.na(name)) %>%
pivot_wider()
给出:
# A tibble: 6 x 7
clade_names d_ g_ p_ c_ f_ o_
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 d_Bacteria|g_Thermobaculum d_Bacte~ g_Thermobac~ NA NA NA NA
2 d_Bacteria|p_Acidobacteria|c_Acidobacteria subdiv~ d_Bacte~ g_Luteitalea p_Acidob~ c_Acidobacte~ f_Vicinami~ NA
3 d_Bacteria|p_Acidobacteria|c_Acidobacteriia|o_Aci~ d_Bacte~ g_Acidobact~ p_Acidob~ c_Acidobacte~ f_Acidobac~ o_Acidoba~
4 d_Bacteria|p_Acidobacteria|c_Acidobacteriia|o_Aci~ d_Bacte~ g_Candidatu~ p_Acidob~ c_Acidobacte~ f_Acidobac~ o_Acidoba~
5 d_Bacteria|p_Acidobacteria|c_Acidobacteriia|o_Aci~ d_Bacte~ g_Granulice~ p_Acidob~ c_Acidobacte~ f_Acidobac~ o_Acidoba~
6 d_Bacteria|p_Acidobacteria|c_Acidobacteriia|o_Aci~ d_Bacte~ g_Terriglob~ p_Acidob~ c_Acidobacte~ f_Acidobac~ o_Acidoba~
您当然可以进一步调整此代码以提供更有意义的名称或从 clade_names 等中删除 X_ 部分。