R - 根据模式和条件删除字符串列中的子字符串

Question

我在数据框中有一列字符串，我想在其中替换值以仅包含第一个 " (" 之前的子字符串，即第一个 space/open 括号对之前的子字符串。并非所有字符串都包含括号，我希望它们保持原样。

示例数据：

col1 <- c(1, 2, 3, 4)
col2 <- c("a b (ABC DE)", "bcd", "cd ef (CE)", "bcd")
df <- data.frame(col1, col2)
df

输出：

  col1       col2
1    1 a b (ABC DE)
2    2        bcd
3    3  cd ef (CE)
4    4        bcd

我正在寻找的输出是这样的：

col1 <- c(1, 2, 3, 4)
col2 <- c("a b", "bcd", "cd ef", "bcd")
df <- data.frame(col1, col2)
df

输出：

  col1 col2
1    1  a b
2    2  bcd
3    3 cd ef
4    4  bcd

实际数据框有 40000 多行，字符串有很多可能的值，所以不能像示例中那样手动完成。我对使用 regex/patterns 完全没有信心，但接受这可能是最直接的方法。

Answer 1

这里有一个dplyr方法

library(dplyr)
library(stringr)

df %>% 
  mutate(col2 = str_replace_all(col2, "\(.+?\)", ""))

其中 returns df:

  col1   col2
1    1   a b 
2    2    bcd
3    3 cd ef 
4    4    bcd

Answer 2

我宁愿使用正则表达式也不愿使用子字符串。

transform(df, col2=gsub('\s+\(.*', '', x))
#   col1 col2
# 1    1   ab
# 2    2  bcd
# 3    3 cedf
# 4    4  bcd

Answer 3

可能的解决方案，基于stringr：

library(tidyverse)

df %>% 
  mutate(col2 = str_remove_all(col2, "\s*\(.*\)\s*"))

#>   col1  col2
#> 1    1   a b
#> 2    2   bcd
#> 3    3 cd ef
#> 4    4   bcd

Answer 4

使用 R 基础 gsub

> df$col2 <- gsub("\s*\(.*\)", "", df$col2)
> df
  col1  col2
1    1   a b
2    2   bcd
3    3 cd ef
4    4   bcd

R - 根据模式和条件删除字符串列中的子字符串

R - removing substring in column of strings based on pattern and condition

string

substring

r

dataframe