根据多个条件从因子列创建新列
Create new column from factor column by multiple conditions
我想从包含多个因素但部分因素名称重复出现的现有列创建一个新列。让我举例说明:
factorA <- c("paul173643738","paul827484","george39585496","george7848658946","john2354674","john346","ringo384934","ringo24653")
df <- data.frame(factorA)
这是我的尝试:
library(dplyr)
df <- mutate(
df,factorB = case_when(
matches(factorA,"paul.") ~ "paul",
matches(factorA,"george.") ~ "george",
matches(factorA,"john.") ~ "john",
matches(factorA,"ringo.") ~ "ringo",
TRUE ~ "NA"))
这给了我 Error in mutate_impl(.data, dots) : Evaluation error: is_string(match) is not TRUE.
我认为这是我没有正确指定 R
应该如何查找我想要的字符串片段的结果。
结果应如下所示:
factorA factorB
1 paul173643738 paul
2 paul827484 paul
3 george39585496 george
4 george7848658946 george
5 john2354674 john
6 john346 john
7 ringo384934 ringo
8 ringo24653 ringo
我确定以前有人问过这个问题,但我找不到适合我需要的答案。任何帮助将不胜感激。
使用stringr
library(stringr)
df %>%
mutate(factorB = case_when(
str_detect(factorA, 'paul') ~ 'paul',
str_detect(factorA,"paul.") ~ "paul",
str_detect(factorA,"george.") ~ "george",
str_detect(factorA,"john.") ~ "john",
str_detect(factorA,"ringo.") ~ "ringo"
))
您可以使用 stringr::str_detect
:
library(tidyverse)
factorA <- c("paul173643738","paul827484","george39585496","george7848658946","john2354674","john346","ringo384934","ringo24653")
df <- as_data_frame(factorA)
df %>%
mutate(factorB = case_when(
str_detect(factorA, "paul") ~ "paul",
str_detect(factorA, "george") ~ "george",
str_detect(factorA, "john") ~ "john",
str_detect(factorA, "ringo") ~ "ringo"
))
#> # A tibble: 8 x 2
#> value factorB
#> <chr> <chr>
#> 1 paul173643738 paul
#> 2 paul827484 paul
#> 3 george39585496 george
#> 4 george7848658946 george
#> 5 john2354674 john
#> 6 john346 john
#> 7 ringo384934 ringo
#> 8 ringo24653 ringo
如果 factorA
中指定的字符串格式是固定的,您可以使用 gsub
:
提取名称
only_names <- gsub('(^[A-Za-z]*).*', '\1', factorA)
使用 R 基础 sub
和正则表达式
> data.frame(factorA, factor8=sub("\d+", "", factorA))
factorA factor8
1 paul173643738 paul
2 paul827484 paul
3 george39585496 george
4 george7848658946 george
5 john2354674 john
6 john346 john
7 ringo384934 ringo
8 ringo24653 ringo
试试 extract
和一个只检测字母的正则表达式。
my.regex <- "([a-z]+)"
df %>%
extract(factorA,
into = "factorB",
regex = my.regex,
remove = FALSE)
#> factorA factorB
#> 1 paul173643738 paul
#> 2 paul827484 paul
#> 3 george39585496 george
#> 4 george7848658946 george
#> 5 john2354674 john
#> 6 john346 john
#> 7 ringo384934 ringo
#> 8 ringo24653 ringo
通常,我会追求更清晰的数据,但具有离散值和名称....
my.regex <- "([a-z]+)([0-9]+)"
df %>%
extract(factorA,
into = c("factorA", "factorB"),
regex = my.regex,
remove = FALSE)
#> factorA factorB
#> 1 paul 173643738
#> 2 paul 827484
#> 3 george 39585496
#> 4 george 7848658946
#> 5 john 2354674
#> 6 john 346
#> 7 ringo 384934
#> 8 ringo 24653
```
由 reprex package (v0.2.0) 创建于 2018-07-28。
我想从包含多个因素但部分因素名称重复出现的现有列创建一个新列。让我举例说明:
factorA <- c("paul173643738","paul827484","george39585496","george7848658946","john2354674","john346","ringo384934","ringo24653")
df <- data.frame(factorA)
这是我的尝试:
library(dplyr)
df <- mutate(
df,factorB = case_when(
matches(factorA,"paul.") ~ "paul",
matches(factorA,"george.") ~ "george",
matches(factorA,"john.") ~ "john",
matches(factorA,"ringo.") ~ "ringo",
TRUE ~ "NA"))
这给了我 Error in mutate_impl(.data, dots) : Evaluation error: is_string(match) is not TRUE.
我认为这是我没有正确指定 R
应该如何查找我想要的字符串片段的结果。
结果应如下所示:
factorA factorB
1 paul173643738 paul
2 paul827484 paul
3 george39585496 george
4 george7848658946 george
5 john2354674 john
6 john346 john
7 ringo384934 ringo
8 ringo24653 ringo
我确定以前有人问过这个问题,但我找不到适合我需要的答案。任何帮助将不胜感激。
使用stringr
library(stringr)
df %>%
mutate(factorB = case_when(
str_detect(factorA, 'paul') ~ 'paul',
str_detect(factorA,"paul.") ~ "paul",
str_detect(factorA,"george.") ~ "george",
str_detect(factorA,"john.") ~ "john",
str_detect(factorA,"ringo.") ~ "ringo"
))
您可以使用 stringr::str_detect
:
library(tidyverse)
factorA <- c("paul173643738","paul827484","george39585496","george7848658946","john2354674","john346","ringo384934","ringo24653")
df <- as_data_frame(factorA)
df %>%
mutate(factorB = case_when(
str_detect(factorA, "paul") ~ "paul",
str_detect(factorA, "george") ~ "george",
str_detect(factorA, "john") ~ "john",
str_detect(factorA, "ringo") ~ "ringo"
))
#> # A tibble: 8 x 2
#> value factorB
#> <chr> <chr>
#> 1 paul173643738 paul
#> 2 paul827484 paul
#> 3 george39585496 george
#> 4 george7848658946 george
#> 5 john2354674 john
#> 6 john346 john
#> 7 ringo384934 ringo
#> 8 ringo24653 ringo
如果 factorA
中指定的字符串格式是固定的,您可以使用 gsub
:
only_names <- gsub('(^[A-Za-z]*).*', '\1', factorA)
使用 R 基础 sub
和正则表达式
> data.frame(factorA, factor8=sub("\d+", "", factorA))
factorA factor8
1 paul173643738 paul
2 paul827484 paul
3 george39585496 george
4 george7848658946 george
5 john2354674 john
6 john346 john
7 ringo384934 ringo
8 ringo24653 ringo
试试 extract
和一个只检测字母的正则表达式。
my.regex <- "([a-z]+)"
df %>%
extract(factorA,
into = "factorB",
regex = my.regex,
remove = FALSE)
#> factorA factorB
#> 1 paul173643738 paul
#> 2 paul827484 paul
#> 3 george39585496 george
#> 4 george7848658946 george
#> 5 john2354674 john
#> 6 john346 john
#> 7 ringo384934 ringo
#> 8 ringo24653 ringo
通常,我会追求更清晰的数据,但具有离散值和名称....
my.regex <- "([a-z]+)([0-9]+)"
df %>%
extract(factorA,
into = c("factorA", "factorB"),
regex = my.regex,
remove = FALSE)
#> factorA factorB
#> 1 paul 173643738
#> 2 paul 827484
#> 3 george 39585496
#> 4 george 7848658946
#> 5 john 2354674
#> 6 john 346
#> 7 ringo 384934
#> 8 ringo 24653
```
由 reprex package (v0.2.0) 创建于 2018-07-28。