用 r 中的条件替换数据框中的部分字符串
replace partial of character string in a data frame by conditions in r
我有一个这样的数据框:
df = read.table(text="REF Alt S00001 S00002 S00003 S00004 S00005
TAAGAAG TAAG TAAGAAG/TAAGAAG TAAGAAG/TAAG TAAG/TAAG TAAGAAG/TAAGAAG TAAGAAG/TAAGAAG
T TG T/T -/- TG/TG T/T T/T
CAAAA CAAA CAAAA/CAAAA CAAAA/CAAA CAAAA/CAAAA -/- CAAAA/CAAAA
TTGT TTGTGT TTGT/TTGT TTGT/TTGT TTGT/TTGT TTGTGT/TTGTGT TTGT/TTGTGT
GTTT GTTTTT GTTT/GTTTTT GTTT/GTTT GTTT/GTTT GTTT/GTTT GTTTTT/GTTTTT", header=T, stringsAsFactors=F)
我想用 "D" 或 "I" 替换由“/”分隔的字符元素,具体取决于 "REF" 和 "Alt" 列中字符串的长度].如果元素匹配最长的一个,则将其替换为 "I",否则将替换为 "D"。但“-”没有变化。因此预期结果为:
REF Alt S00001 S00002 S00003 S00004 S00005
TAAGAAG TAAG I/I I/D D/D I/I I/I
T TG D/D -/- I/I D/D D/D
CAAAA CAAA I/I I/D I/I -/- I/I
TTGT TTGTGT D/D D/D D/D I/I D/I
GTTT GTTTTT D/I D/D D/D D/D I/I
您可以创建包含 REF
和 Alt
的所有组合以及 I
和 D
的相应组合的地图:
refalt <- data.frame(
from=c(df$REF, df$Alt),
to=c(rep('I', length(df$REF)), rep('D', length(df$Alt))),
stringsAsFactors=FALSE)
refalt <- rbind(refalt, c('-', '-'))
from <- expand.grid(refalt$from, refalt$from)
to <- expand.grid(refalt$to, refalt$to)
map <- paste(to[,1], to[,2], sep='/')
names(map) <- paste(from[,1], from[,2], sep='/')
然后,您可以为每一列使用地图:
for (name in paste0('S0000', seq(5))) {
df[[name]] <- map[df[[name]]]
}
这是一种方法。我使用了 stringi
包,因为它可以很好地处理要搜索的模式向量和字符串向量。
首先确定哪个字符串更短,哪个字符串更长:
short <- ifelse(nchar(df$Alt) > nchar(df$REF), df$REF, df$Alt)
long <- ifelse(nchar(df$REF) > nchar(df$Alt), df$REF, df$Alt)
使用这些并遍历您的列,并根据需要指定替换项。首先替换长模式以避免匹配短模式和长模式的字符串出现问题:
library(stringi)
df[,!(names(df) %in% c("REF", "Alt"))] <- # assign into original df
lapply(1:(ncol(df) - 2), # - 2 because there are two columns we don't use
function(ii) stri_replace_all_fixed(df[ ,ii + 2], long, "I")) # + 2 to skip first 2 columns
df[,!(names(df) %in% c("REF", "Alt"))] <-
lapply(1:(ncol(df) - 2),
function(ii) stri_replace_all_fixed(df[ ,ii + 2], short, "D"))
# REF Alt S00001 S00002 S00003 S00004 S00005
#1 TAAGAAG TAAG I/I I/D D/D I/I I/I
#2 T TG D/D -/- I/I D/D D/D
#3 CAAAA CAAA I/I I/D I/I -/- I/I
#4 TTGT TTGTGT D/D D/D D/D I/I D/I
#5 GTTT GTTTTT D/I D/D D/D D/D I/I
我有一个这样的数据框:
df = read.table(text="REF Alt S00001 S00002 S00003 S00004 S00005
TAAGAAG TAAG TAAGAAG/TAAGAAG TAAGAAG/TAAG TAAG/TAAG TAAGAAG/TAAGAAG TAAGAAG/TAAGAAG
T TG T/T -/- TG/TG T/T T/T
CAAAA CAAA CAAAA/CAAAA CAAAA/CAAA CAAAA/CAAAA -/- CAAAA/CAAAA
TTGT TTGTGT TTGT/TTGT TTGT/TTGT TTGT/TTGT TTGTGT/TTGTGT TTGT/TTGTGT
GTTT GTTTTT GTTT/GTTTTT GTTT/GTTT GTTT/GTTT GTTT/GTTT GTTTTT/GTTTTT", header=T, stringsAsFactors=F)
我想用 "D" 或 "I" 替换由“/”分隔的字符元素,具体取决于 "REF" 和 "Alt" 列中字符串的长度].如果元素匹配最长的一个,则将其替换为 "I",否则将替换为 "D"。但“-”没有变化。因此预期结果为:
REF Alt S00001 S00002 S00003 S00004 S00005
TAAGAAG TAAG I/I I/D D/D I/I I/I
T TG D/D -/- I/I D/D D/D
CAAAA CAAA I/I I/D I/I -/- I/I
TTGT TTGTGT D/D D/D D/D I/I D/I
GTTT GTTTTT D/I D/D D/D D/D I/I
您可以创建包含 REF
和 Alt
的所有组合以及 I
和 D
的相应组合的地图:
refalt <- data.frame(
from=c(df$REF, df$Alt),
to=c(rep('I', length(df$REF)), rep('D', length(df$Alt))),
stringsAsFactors=FALSE)
refalt <- rbind(refalt, c('-', '-'))
from <- expand.grid(refalt$from, refalt$from)
to <- expand.grid(refalt$to, refalt$to)
map <- paste(to[,1], to[,2], sep='/')
names(map) <- paste(from[,1], from[,2], sep='/')
然后,您可以为每一列使用地图:
for (name in paste0('S0000', seq(5))) {
df[[name]] <- map[df[[name]]]
}
这是一种方法。我使用了 stringi
包,因为它可以很好地处理要搜索的模式向量和字符串向量。
首先确定哪个字符串更短,哪个字符串更长:
short <- ifelse(nchar(df$Alt) > nchar(df$REF), df$REF, df$Alt)
long <- ifelse(nchar(df$REF) > nchar(df$Alt), df$REF, df$Alt)
使用这些并遍历您的列,并根据需要指定替换项。首先替换长模式以避免匹配短模式和长模式的字符串出现问题:
library(stringi)
df[,!(names(df) %in% c("REF", "Alt"))] <- # assign into original df
lapply(1:(ncol(df) - 2), # - 2 because there are two columns we don't use
function(ii) stri_replace_all_fixed(df[ ,ii + 2], long, "I")) # + 2 to skip first 2 columns
df[,!(names(df) %in% c("REF", "Alt"))] <-
lapply(1:(ncol(df) - 2),
function(ii) stri_replace_all_fixed(df[ ,ii + 2], short, "D"))
# REF Alt S00001 S00002 S00003 S00004 S00005
#1 TAAGAAG TAAG I/I I/D D/D I/I I/I
#2 T TG D/D -/- I/I D/D D/D
#3 CAAAA CAAA I/I I/D I/I -/- I/I
#4 TTGT TTGTGT D/D D/D D/D I/I D/I
#5 GTTT GTTTTT D/I D/D D/D D/D I/I