根据 R 中另一列中的值替换一列中第二次出现的字符串
Replace second occurrence of a string in one column based on value in other column in R
这是一个示例数据框:
a <- c("cat", "dog", "mouse")
b <- c("my cat is a tabby cat and is a friendly cat", "walk the dog", "the mouse is scared of the other mouse")
df <- data.frame(a,b)
我希望能够删除 b 列中 a 列中第二次出现的值。
这是我想要的输出:
a b
cat my cat is a tabby and is a friendly cat
dog walk the dog
mouse the mouse is scared of the other
我已经尝试了 gsub 和一些 stringr 函数的不同组合,但我什至还没有接近于能够删除 col a 和 col b 中字符串的第二次(也是仅第二次)出现。我想我问的是类似于 this 的问题,但我不熟悉 Perl,无法将其翻译成 R。
谢谢!
构建正确的正则表达式需要一些工作。
P1 = paste(a, collapse="|")
PAT = paste0("((", P1, ").*?)(\2)")
sub(PAT, "\1", b, perl=TRUE)
[1] "my cat is a tabby and is a friendly cat"
[2] "walk the dog"
[3] "the mouse is scared of the other "
你可以这样做...
library(stringr)
df$b <- str_replace(df$b,
paste0("(.*?",df$a,".*?) ",df$a),
"\1")
df
a b
1 cat my cat is a tabby and is a friendly cat
2 dog walk the dog
3 mouse the mouse is scared of the other
正则表达式查找第一个字符串中某处带有 df$a
的字符,后跟一个 space 和另一个 df$a
。捕获组是直到第二次出现(由 (...)
表示)之前 space 的文本,并且整个文本(包括第二次出现)被捕获组 \1
替换(具有删除第二个 df$a
及其前面的 space 的效果)。第二个 df$a
之后的任何内容都不会受到影响。
我实际上找到了另一个解决方案,虽然更长,但对于其他正则表达式初学者来说可能更清晰:
library(stringr)
# Replace first instance of col a in col b with "INTERIM"
df$b <- str_replace(b, a, "INTERIM")
# Now that the original first instance of col a is re-labeled to "INTERIM", I can again replace the first instance of col a in col b, this time with an empty string
df$b <- str_replace(df$b, a, "")
# And I can re-replace the re-labeled "INTERIM" to the original string in col a
df$b <- str_replace(df$b, "INTERIM", a)
# Trim "double" whitespace
df$b <- str_replace(gsub("\s+", " ", str_trim(df$b)), "B", "b")
df
a b
cat my cat is a tabby and is a friendly cat
dog walk the dog
mouse the mouse is scared of the other
基础 R,拆分应用组合解决方案:
# Split-apply-combine:
data.frame(do.call("rbind", lapply(split(df, df$a), function(x){
b <- paste(unique(unlist(strsplit(x$b, "\s+"))), collapse = " ")
return(data.frame(a = x$a, b = b))
}
)
),
stringsAsFactors = FALSE, row.names = NULL
)
数据:
df <- data.frame(a = c("cat", "dog", "mouse"),
b = c("my cat is a tabby cat and is a friendly cat", "walk the dog", "the mouse is scared of the other mouse"),
stringsAsFactors = FALSE)
这是一个示例数据框:
a <- c("cat", "dog", "mouse")
b <- c("my cat is a tabby cat and is a friendly cat", "walk the dog", "the mouse is scared of the other mouse")
df <- data.frame(a,b)
我希望能够删除 b 列中 a 列中第二次出现的值。
这是我想要的输出:
a b
cat my cat is a tabby and is a friendly cat
dog walk the dog
mouse the mouse is scared of the other
我已经尝试了 gsub 和一些 stringr 函数的不同组合,但我什至还没有接近于能够删除 col a 和 col b 中字符串的第二次(也是仅第二次)出现。我想我问的是类似于 this 的问题,但我不熟悉 Perl,无法将其翻译成 R。
谢谢!
构建正确的正则表达式需要一些工作。
P1 = paste(a, collapse="|")
PAT = paste0("((", P1, ").*?)(\2)")
sub(PAT, "\1", b, perl=TRUE)
[1] "my cat is a tabby and is a friendly cat"
[2] "walk the dog"
[3] "the mouse is scared of the other "
你可以这样做...
library(stringr)
df$b <- str_replace(df$b,
paste0("(.*?",df$a,".*?) ",df$a),
"\1")
df
a b
1 cat my cat is a tabby and is a friendly cat
2 dog walk the dog
3 mouse the mouse is scared of the other
正则表达式查找第一个字符串中某处带有 df$a
的字符,后跟一个 space 和另一个 df$a
。捕获组是直到第二次出现(由 (...)
表示)之前 space 的文本,并且整个文本(包括第二次出现)被捕获组 \1
替换(具有删除第二个 df$a
及其前面的 space 的效果)。第二个 df$a
之后的任何内容都不会受到影响。
我实际上找到了另一个解决方案,虽然更长,但对于其他正则表达式初学者来说可能更清晰:
library(stringr)
# Replace first instance of col a in col b with "INTERIM"
df$b <- str_replace(b, a, "INTERIM")
# Now that the original first instance of col a is re-labeled to "INTERIM", I can again replace the first instance of col a in col b, this time with an empty string
df$b <- str_replace(df$b, a, "")
# And I can re-replace the re-labeled "INTERIM" to the original string in col a
df$b <- str_replace(df$b, "INTERIM", a)
# Trim "double" whitespace
df$b <- str_replace(gsub("\s+", " ", str_trim(df$b)), "B", "b")
df
a b
cat my cat is a tabby and is a friendly cat
dog walk the dog
mouse the mouse is scared of the other
基础 R,拆分应用组合解决方案:
# Split-apply-combine:
data.frame(do.call("rbind", lapply(split(df, df$a), function(x){
b <- paste(unique(unlist(strsplit(x$b, "\s+"))), collapse = " ")
return(data.frame(a = x$a, b = b))
}
)
),
stringsAsFactors = FALSE, row.names = NULL
)
数据:
df <- data.frame(a = c("cat", "dog", "mouse"),
b = c("my cat is a tabby cat and is a friendly cat", "walk the dog", "the mouse is scared of the other mouse"),
stringsAsFactors = FALSE)