使用 str_detect() 从列中提取信息,然后创建一个新列
Use str_detect() to extract information from a column and then create a new column
我正在使用 data.frame,其中包含一列,其值的命名方式如下:D1_open、D9_shurb、D10_open 等
我想创建一个新列,其值只是“open”或“shurb”。也就是说,我想从“ID_SubPlot”中提取“open”和“shrub”这两个词,并将它们放在一个新的列中。我相信 str_detect() 会很有用,但我不知道怎么用。
示例数据:
test <- structure(list(ID_Plant = c(243, 370, 789, 143, 559, 588, 746,
618, 910, 898), ID_SubPlot = c("D1_open", "D9_shrub", "D8_open",
"E4_shrub", "U5_shrub", "U10_open", "S10_shrub", "U10_shrub",
"S9_shrub", "S9_shrub")), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
正则表达式(另见 regex cheatsheet for R)
只需使用 ".*_(.*)"
捕获第一组中 _ 之后的所有内容,并用第一个捕获的组替换每个字符串。
test$col = gsub(".*_(.*)", "\1", test$ID_SubPlot)
test
ID_Plant ID_SubPlot col
1 243 D1_open open
2 370 D9_shrub shrub
3 789 D8_open open
4 143 E4_shrub shrub
5 559 U5_shrub shrub
6 588 U10_open open
7 746 S10_shrub shrub
8 618 U10_shrub shrub
9 910 S9_shrub shrub
10 898 S9_shrub shrub
数据
test=structure(list(ID_Plant = c(243, 370, 789, 143, 559, 588, 746, 618, 910, 898),
ID_SubPlot = c("D1_open", "D9_shrub", "D8_open", "E4_shrub", "U5_shrub", "U10_open", "S10_shrub", "U10_shrub", "S9_shrub", "S9_shrub")),
row.names = c(NA, -10L), class = c("data.frame"))
这是一种使用 tidyr
中的 separate
的方法:
library(tidyr)
separate(test, ID_SubPlot, into = c("Code", "NewCol"), sep = "_")
输出
ID_Plant Code NewCol
1 243 D1 open
2 370 D9 shrub
3 789 D8 open
4 143 E4 shrub
5 559 U5 shrub
6 588 U10 open
7 746 S10 shrub
8 618 U10 shrub
9 910 S9 shrub
10 898 S9 shrub
这也可以帮助你。我假设您想删除 ID 部分加上下划线:
library(dplyr)
library(stringr)
test %>%
mutate(result = str_remove(ID_SubPlot, "^[A-Za-z]\d+(_)"))
# A tibble: 10 x 3
ID_Plant ID_SubPlot result
<dbl> <chr> <chr>
1 243 D1_open open
2 370 D9_shrub shrub
3 789 D8_open open
4 143 E4_shrub shrub
5 559 U5_shrub shrub
6 588 U10_open open
7 746 S10_shrub shrub
8 618 U10_shrub shrub
9 910 S9_shrub shrub
10 898 S9_shrub shrub
我正在使用 data.frame,其中包含一列,其值的命名方式如下:D1_open、D9_shurb、D10_open 等
我想创建一个新列,其值只是“open”或“shurb”。也就是说,我想从“ID_SubPlot”中提取“open”和“shrub”这两个词,并将它们放在一个新的列中。我相信 str_detect() 会很有用,但我不知道怎么用。
示例数据:
test <- structure(list(ID_Plant = c(243, 370, 789, 143, 559, 588, 746,
618, 910, 898), ID_SubPlot = c("D1_open", "D9_shrub", "D8_open",
"E4_shrub", "U5_shrub", "U10_open", "S10_shrub", "U10_shrub",
"S9_shrub", "S9_shrub")), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
正则表达式(另见 regex cheatsheet for R)
只需使用 ".*_(.*)"
捕获第一组中 _ 之后的所有内容,并用第一个捕获的组替换每个字符串。
test$col = gsub(".*_(.*)", "\1", test$ID_SubPlot)
test
ID_Plant ID_SubPlot col
1 243 D1_open open
2 370 D9_shrub shrub
3 789 D8_open open
4 143 E4_shrub shrub
5 559 U5_shrub shrub
6 588 U10_open open
7 746 S10_shrub shrub
8 618 U10_shrub shrub
9 910 S9_shrub shrub
10 898 S9_shrub shrub
数据
test=structure(list(ID_Plant = c(243, 370, 789, 143, 559, 588, 746, 618, 910, 898),
ID_SubPlot = c("D1_open", "D9_shrub", "D8_open", "E4_shrub", "U5_shrub", "U10_open", "S10_shrub", "U10_shrub", "S9_shrub", "S9_shrub")),
row.names = c(NA, -10L), class = c("data.frame"))
这是一种使用 tidyr
中的 separate
的方法:
library(tidyr)
separate(test, ID_SubPlot, into = c("Code", "NewCol"), sep = "_")
输出
ID_Plant Code NewCol
1 243 D1 open
2 370 D9 shrub
3 789 D8 open
4 143 E4 shrub
5 559 U5 shrub
6 588 U10 open
7 746 S10 shrub
8 618 U10 shrub
9 910 S9 shrub
10 898 S9 shrub
这也可以帮助你。我假设您想删除 ID 部分加上下划线:
library(dplyr)
library(stringr)
test %>%
mutate(result = str_remove(ID_SubPlot, "^[A-Za-z]\d+(_)"))
# A tibble: 10 x 3
ID_Plant ID_SubPlot result
<dbl> <chr> <chr>
1 243 D1_open open
2 370 D9_shrub shrub
3 789 D8_open open
4 143 E4_shrub shrub
5 559 U5_shrub shrub
6 588 U10_open open
7 746 S10_shrub shrub
8 618 U10_shrub shrub
9 910 S9_shrub shrub
10 898 S9_shrub shrub