根据多个条件从因子列创建新列

Create new column from factor column by multiple conditions

我想从包含多个因素但部分因素名称重复出现的现有列创建一个新列。让我举例说明:

factorA <- c("paul173643738","paul827484","george39585496","george7848658946","john2354674","john346","ringo384934","ringo24653")
df <- data.frame(factorA)

这是我的尝试:

library(dplyr)
    df <- mutate(
           df,factorB = case_when(
           matches(factorA,"paul.") ~ "paul",
           matches(factorA,"george.") ~ "george",
           matches(factorA,"john.") ~ "john",
           matches(factorA,"ringo.") ~ "ringo",
           TRUE ~ "NA"))

这给了我 Error in mutate_impl(.data, dots) : Evaluation error: is_string(match) is not TRUE. 我认为这是我没有正确指定 R 应该如何查找我想要的字符串片段的结果。

结果应如下所示:

           factorA  factorB
1    paul173643738  paul
2       paul827484  paul 
3   george39585496  george
4 george7848658946  george
5      john2354674  john
6          john346  john
7      ringo384934  ringo
8       ringo24653  ringo

我确定以前有人问过这个问题,但我找不到适合我需要的答案。任何帮助将不胜感激。

使用stringr

library(stringr)
df %>%
mutate(factorB = case_when(
str_detect(factorA, 'paul') ~ 'paul',
str_detect(factorA,"paul.") ~ "paul",
str_detect(factorA,"george.") ~ "george",
str_detect(factorA,"john.") ~ "john",
str_detect(factorA,"ringo.") ~ "ringo"
))

您可以使用 stringr::str_detect

library(tidyverse)
factorA <- c("paul173643738","paul827484","george39585496","george7848658946","john2354674","john346","ringo384934","ringo24653")
df <- as_data_frame(factorA)
df %>% 
  mutate(factorB = case_when(
    str_detect(factorA, "paul") ~ "paul",
    str_detect(factorA, "george") ~ "george",
    str_detect(factorA, "john") ~ "john",
    str_detect(factorA, "ringo") ~ "ringo"
  ))
#> # A tibble: 8 x 2
#>   value            factorB
#>   <chr>            <chr>  
#> 1 paul173643738    paul   
#> 2 paul827484       paul   
#> 3 george39585496   george 
#> 4 george7848658946 george 
#> 5 john2354674      john   
#> 6 john346          john   
#> 7 ringo384934      ringo  
#> 8 ringo24653       ringo

如果 factorA 中指定的字符串格式是固定的,您可以使用 gsub:

提取名称
only_names <- gsub('(^[A-Za-z]*).*', '\1', factorA)

使用 R 基础 sub 和正则表达式

> data.frame(factorA, factor8=sub("\d+", "", factorA))
           factorA factor8
1    paul173643738    paul
2       paul827484    paul
3   george39585496  george
4 george7848658946  george
5      john2354674    john
6          john346    john
7      ringo384934   ringo
8       ringo24653   ringo

试试 extract 和一个只检测字母的正则表达式。

my.regex <- "([a-z]+)"

df %>% 
  extract(factorA, 
          into = "factorB", 
          regex = my.regex,
          remove = FALSE)

#>            factorA factorB
#> 1    paul173643738    paul
#> 2       paul827484    paul
#> 3   george39585496  george
#> 4 george7848658946  george
#> 5      john2354674    john
#> 6          john346    john
#> 7      ringo384934   ringo
#> 8       ringo24653   ringo

通常,我会追求更清晰的数据,但具有离散值和名称....

 my.regex <- "([a-z]+)([0-9]+)"        

  df %>% 
  extract(factorA, 
          into = c("factorA", "factorB"), 
          regex = my.regex,
          remove = FALSE)

#>   factorA    factorB
#> 1    paul  173643738
#> 2    paul     827484
#> 3  george   39585496
#> 4  george 7848658946
#> 5    john    2354674
#> 6    john        346
#> 7   ringo     384934
#> 8   ringo      24653
```

reprex package (v0.2.0) 创建于 2018-07-28。