R条件逻辑，根据字符串中的前后字符替换字符串中的字符

Question

我有一个字符串向量，其中我用下划线替换了 space。我打算将它们重新转换为 spaces，但是，原始数据中存在一些语法错误，这意味着某些 spaces 实际上不应该是 spaces。我有一些简单的条件逻辑来描述下划线何时应替换为 space、何时应替换为破折号 (-) 或何时应完全删除。

字符串是化合物名称。如果下划线在数字之后或之前，则应将下划线替换为破折号（“-”）。如果字母前后有下划线，则应将其替换为 space (" ")。如果下划线位于破折号之前或之后，则下划线应删除而无需替换。这些场景中的不止一种可能适用于给定字符串中的不同位置。另一个问题是，如果数字直接跟在字母后面或前面，它们之间应该有破折号。

这是一个最小的数据集，它演示了所有这些场景和期望的结果。请注意，实际数据集有超过 35000 个条目（虽然只有 670 个唯一条目）。

names
[1] "1,8_cineole" "geranyl_acetate" "AR_curcumene" "trans_trans_a-farnesene" "trans_muurola_4,5_diene"
[6] "p_cymene" "a_-_pinene" "cadina_3,5_diene" "germacrene_D" "trans_cadina1,4diene"

converted_names
[1] "1,8-cineole" "geranyl acetate" "AR curcumene" "trans trans a-farnesene" "trans muurola-4,5-diene"
[6] "p cymene" "a-pinene" "cadina-3,5-diene" "germacrene D" "trans cadina-1,4-diene"

我正在考虑通过嵌套循环来解决这个问题，嵌套循环遍历名称列表，然后拆分每个名称的字符串并遍历名称的各个字符，但我在应用条件时有点迷茫替换字符串中的单个字符所需的逻辑。

convert_compound_names<-function(x){
    underscore_locations<-lapply(strsplit(x,""),function(x) which(x=="_"))
    digit_locations<-lapply(strsplit(x,""),function(x) grep("\d",x))
    for(i in c(1:length(x)))
      split_name<-unlist(strsplit(x[i],""))
          for (j in c(1:length(split_name)))){
         #some conditional logic to replace underscores here
          }
      x[i]<-paste0(split_name[1:length(split_name)],collapse="")
     }
    return(x) 
}

我还想知道是否可以将条件逻辑合并到 gsub 函数中并且可能不需要循环..?

郑重声明，我是一名化学家，不是程序员或数据科学家，所以任何意见、建议或道德支持都将不胜感激。

感谢阅读。

Answer 1

chem_names <- c("1,8_cineole", "geranyl_acetate", "AR_curcumene", "trans_trans_a-farnesene",
                "trans_muurola_4,5_diene", "p_cymene", "a_-_pinene", "cadina_3,5_diene",
                "germacrene_D", "trans_cadina1,4diene")

这听起来像是一个正则表达式问题，我对此还是个新手，但我认为下面的代码可以满足您的需求。

在这里，我首先用(?=\d)形式的lookaround将所有的“_#”替换为“-”，这样它会查找而不是替换 数字 [\d]，位于下划线 [\_] 之后，下划线将被 - 替换。然后同样处理数字后面的破折号，并将所有剩余的下划线变为空格。

library(dplyr); library(stringr)
data.frame(chem_names) %>%
  mutate(chem_names2 = chem_names %>%
           str_replace_all("\_(?=\d)", "-") %>%  # replace _# with -
           str_replace_all("(?<=\d)\_", "-") %>% # replace #_ with -
           str_replace_all("\_", " "))            # replace _ with space

结果

                chem_names             chem_names2
1              1,8_cineole             1,8-cineole
2          geranyl_acetate         geranyl acetate
3             AR_curcumene            AR curcumene
4  trans_trans_a-farnesene trans trans a-farnesene
5  trans_muurola_4,5_diene trans muurola-4,5-diene
6                 p_cymene                p cymene
7               a_-_pinene              a - pinene
8         cadina_3,5_diene        cadina-3,5-diene
9             germacrene_D            germacrene D
10    trans_cadina1,4diene    trans cadina1,4diene

Answer 2

我计算出了在我上面提出的循环中解决下划线替换问题所需的条件逻辑：

convert_compound_names<-function(x){

    for(i in c(1:length(x))){
      split_name<-unlist(strsplit(x[i],""))
          for (j in c(1:length(split_name))){
         #some conditional logic to replace underscores here
            if(split_name[j]=="_"){
              if(grepl("\d",split_name[j-1])|(grepl("\d",split_name[j+1]))){split_name[j]<-"-"}
              else if(grepl("-",split_name[j-1])|(grepl("-",split_name[j+1]))){split_name[j]<-""}
              else if(grepl("[a-zA-Z]",split_name[j-1])&&(grepl("[a-zA-Z]",split_name[j+1]))){split_name[j]<-" "}
            }
          }
        x[i]<-paste0(split_name[1:length(split_name)],collapse="")
    }
    return(x) 
}

不过，我确信可以找到更直接的方法。

Answer 3

我认为实现这个的正则表达式其实比较简单。我们使用 ifelse 来检查条件；条件是 str_detect 检测到一个数字 \d。如果是，则 _ 将替换为 -。如果没有，_ 将替换为空格：

libraryr(dplyr)
library(stringr)
data.frame(chem_names) %>%
  mutate(chem_names = ifelse(str_detect(chem_names, "\d"),
                             gsub("_", "-", chem_names),
                             gsub("_", " ", chem_names)))
                chem_names
1              1,8-cineole
2          geranyl acetate
3             AR curcumene
4  trans trans a-farnesene
5  trans-muurola-4,5-diene
6                 p cymene
7               a - pinene
8         cadina-3,5-diene
9             germacrene D
10    trans-cadina1,4diene

数据：

chem_names <- c("1,8_cineole", "geranyl_acetate", "AR_curcumene", "trans_trans_a-farnesene",
                "trans_muurola_4,5_diene", "p_cymene", "a_-_pinene", "cadina_3,5_diene",
                "germacrene_D", "trans_cadina1,4diene")

R条件逻辑，根据字符串中的前后字符替换字符串中的字符

R conditional logic to replace a character in a string, based on the preceeding and following characters in the string

string

r

conditional-statements