如何使用 dplyr 执行 separate() 后跟 mutate_each()
How to perform a separate() followed by a mutate_each() with dplyr
我在 sqlite 数据库中有数据,其中包含一个非第一范式的实体。 'sample_attribute' 列中的字符串如下所示:
isolate: R4166 || age: 43.88 || biomaterial_provider: LIBD || sex: male || tissue: DLPFC || disease: control || race: AA || RIN: 8.7 || Fraction: total || BioSampleModel: Human
我此时的代码:
library(tidyr)
library(dplyr)
library(stringi)
rs.df <- structure(list(run_accession = c("SRR1554537", "SRR2071348"),
platform_parameters = c("INSTRUMENT_MODEL: Illumina HiSeq 2000",
"INSTRUMENT_MODEL: Illumina HiSeq 2000"), sample_attribute = c("isolate: R3452 || age: -0.3836 || biomaterial_provider: LIBD || sex: female || tissue: DLPFC || disease: control || race: AA || RIN: 9.6 || Fraction: total || BioSampleModel: Human", "isolate: R3452 || age: -0.3836 || biomaterial_provider: LIBD || sex: female || tissue: DLPFC || disease: control || race: AA || RIN: 9.6 || Fraction: total || BioSampleModel: Human")), .Names = c("run_accession", "platform_parameters", "sample_attribute"
), row.names = c(NA, -2L), class = "data.frame")
coln <- c("isolate", "age", "biomaterial_provider", "sex", "tissue", "disease", "race",
"RIN", "Fraction", "BioSampleModel")
rs.df <- rs.df %>%
separate(sample_attribute, coln, sep = "\|\|")
head(rs.df, 1)
中间结果:
sample_attribute
run_accession platform_parameters isolate age
1 SRR1554534 INSTRUMENT_MODEL: Illumina HiSeq 2000 isolate: DLPFC age: 40.42
biomaterial_provider sex tissue disease
1 biomaterial_provider: LIBD sex: male tissue: DLPFC disease: Control
race RIN Fraction BioSampleModel
1 race: AA RIN: 8.4 Fraction: total BioSampleModel: Human
目前我继续
for (x in coln){
rs.df[,x] <- stri_replace(rs.df[,x], regex = "^.+:\s*", replacement = "")
}
但这是不灵活的。
有没有办法扩展 dplyr 管道,以便(尽可能)用 %>% 管道中的调用替换 for 循环?
至少,对于 coln
中列的值,从 separate()
调用的结果中删除冒号之前的字符串:
rs.df <- rs.df %>%
separate(sample_attribute, coln, sep = "\|\|") %>%
mutate_each(... stri_replace...) #split pairs at ":", remove part before ":"
(这里的for循环解决了我的separating/cleaning字符串向上的问题。但是,SRAdb数据库中可能有更多这样的列,key:valuepairs由“||”分隔。如何以更灵活的方式处理它们?)
在此处查看@docendo discimus 的回答:dplyr certain columns
你的情况
rs.df <- rs.df %>%
separate(sample_attribute, coln, sep = "\|\|") %>%
mutate_each_(funs(stri_replace(., regex="^.+:\s*", replacement="")), coln)
我在 sqlite 数据库中有数据,其中包含一个非第一范式的实体。 'sample_attribute' 列中的字符串如下所示:
isolate: R4166 || age: 43.88 || biomaterial_provider: LIBD || sex: male || tissue: DLPFC || disease: control || race: AA || RIN: 8.7 || Fraction: total || BioSampleModel: Human
我此时的代码:
library(tidyr)
library(dplyr)
library(stringi)
rs.df <- structure(list(run_accession = c("SRR1554537", "SRR2071348"),
platform_parameters = c("INSTRUMENT_MODEL: Illumina HiSeq 2000",
"INSTRUMENT_MODEL: Illumina HiSeq 2000"), sample_attribute = c("isolate: R3452 || age: -0.3836 || biomaterial_provider: LIBD || sex: female || tissue: DLPFC || disease: control || race: AA || RIN: 9.6 || Fraction: total || BioSampleModel: Human", "isolate: R3452 || age: -0.3836 || biomaterial_provider: LIBD || sex: female || tissue: DLPFC || disease: control || race: AA || RIN: 9.6 || Fraction: total || BioSampleModel: Human")), .Names = c("run_accession", "platform_parameters", "sample_attribute"
), row.names = c(NA, -2L), class = "data.frame")
coln <- c("isolate", "age", "biomaterial_provider", "sex", "tissue", "disease", "race",
"RIN", "Fraction", "BioSampleModel")
rs.df <- rs.df %>%
separate(sample_attribute, coln, sep = "\|\|")
head(rs.df, 1)
中间结果:
sample_attribute
run_accession platform_parameters isolate age
1 SRR1554534 INSTRUMENT_MODEL: Illumina HiSeq 2000 isolate: DLPFC age: 40.42
biomaterial_provider sex tissue disease
1 biomaterial_provider: LIBD sex: male tissue: DLPFC disease: Control
race RIN Fraction BioSampleModel
1 race: AA RIN: 8.4 Fraction: total BioSampleModel: Human
目前我继续
for (x in coln){
rs.df[,x] <- stri_replace(rs.df[,x], regex = "^.+:\s*", replacement = "")
}
但这是不灵活的。
有没有办法扩展 dplyr 管道,以便(尽可能)用 %>% 管道中的调用替换 for 循环?
至少,对于 coln
中列的值,从 separate()
调用的结果中删除冒号之前的字符串:
rs.df <- rs.df %>%
separate(sample_attribute, coln, sep = "\|\|") %>%
mutate_each(... stri_replace...) #split pairs at ":", remove part before ":"
(这里的for循环解决了我的separating/cleaning字符串向上的问题。但是,SRAdb数据库中可能有更多这样的列,key:valuepairs由“||”分隔。如何以更灵活的方式处理它们?)
在此处查看@docendo discimus 的回答:dplyr certain columns
你的情况
rs.df <- rs.df %>%
separate(sample_attribute, coln, sep = "\|\|") %>%
mutate_each_(funs(stri_replace(., regex="^.+:\s*", replacement="")), coln)