使用 grepl 创建基于另一列的列

Create a Column Based on Another Column Using grepl

让我们考虑一个包含两列 wordstemdf。我想创建一个新列来检查 stem 中的值是否包含在 word 中,以及它前面或后面是否有更多字符。最终结果应如下所示:

WORD     STEM     NEW
rerun    run      prefixed
runner   run      suffixed
run      run      none
...      ...      ...

到目前为止,您可以在下面看到我的代码。但是,它不起作用,因为 grepl 表达式应用于 df 的所有行。不管怎样,我觉得应该把我的想法说清楚了。

df$new <- ifelse(grepl(paste0('.+', df$stem, '.+'), df$word), 'both',
             ifelse(grepl(paste0(df$stem, '.+'), df$word), 'suffixed',
                ifelse(grepl(paste0('.+', df$stem), df$word), 'prefixed','none')))

您可以使用 mapply 每行使用 grepl,例如:

ifelse(mapply(grepl, paste0(".+", x$STEM, ".+"), x$WORD), "both",
ifelse(mapply(grepl, paste0(x$STEM, ".+"), x$WORD), "suffixed",
ifelse(mapply(grepl, paste0(".+", x$STEM), x$WORD), "prefixed", "none")))
#"prefixed" "suffixed"     "none" 

或使用 startsWithendsWith 并使用子集形式向量:

c("none", "both", "prefixed", "suffixed")[1 + (1 + startsWith(x$WORD, x$STEM) +
 2*endsWith(x$WORD, x$STEM)) * (nchar(x$WORD) > nchar(x$STEM) &
 mapply(grepl, x$STEM, x$WORD))]
#[1] "suffixed" "prefixed" "none"    

您可以像这样创建 new

df$new <- ifelse(startsWith(df$word, df$stem) & endsWith(df$word, df$stem), 'none',
                 ifelse(startsWith(df$word, df$stem), 'suffixed',
                        ifelse(endsWith(df$word, df$stem), 'prefixed',
                               'both')))

或者,在您处于 dplyr 管道中并且您想避免所有烦人的 df$

df %>%
  mutate(new = ifelse(startsWith(df$word, df$stem) & endsWith(df$word, df$stem), 'none',
                      ifelse(startsWith(df$word, df$stem), 'suffixed',
                             ifelse(endsWith(df$word, df$stem), 'prefixed',
                                    'both'))))

输出

#       word stem     new1
# 1    rerun  run prefixed
# 2   runner  run suffixed
# 3      run  run     none
# 4    aruna  run     both

这是 str_locate 来自 stringrdplyr 的方法:

library(dplyr)
library(stringr)
data %>%
  mutate_at(vars(WORD,STEM), as.character) %>%
  mutate(NEW = 
         case_when(str_locate(WORD,STEM)[,"start"] > 1 &
                   str_locate(WORD,STEM)[,"end"] < nchar(WORD) ~ "both",
                   str_locate(WORD,STEM)[,"start"] > 1 ~ "prefixed",
                   str_locate(WORD,STEM)[,"end"] < nchar(WORD) ~ "suffixed",
                   TRUE ~ "none"))
    WORD STEM      NEW
1  rerun  run prefixed
2 runner  run suffixed
3    run  run     none

我添加了一行以将 WORDSTEM 转换为字符,以防它们是因子。