访问 str_locate 和 str_locate_all 的索引
Accessing indexes of str_locate and str_locate_all
假设我有这个数据框:
df <- structure(list(gender_age = c("males_rating_all_ages", "males_rating_<18",
"males_rating_18-29", "males_rating_30-44", "males_rating_45+",
"males_count_all_ages", "males_count_<18", "males_count_18-29",
"males_count_30-44", "males_count_45+", "females_rating_all_ages",
"females_rating_<18", "females_rating_18-29", "females_rating_30-44",
"females_rating_45+"), count = c("7.4", "8.0", "7.5", "7.2",
"7.5", "4,197", "15", "1,276", "1,631", "921", "7.8", "8.7",
"7.7", "7.8", "8.1")), row.names = c(NA, -15L), class = c("tbl_df",
"tbl", "data.frame"))
并且我想提取 gender_age
列的性别、年龄和类型(即 count
或 rating
)并将它们放在自己的列中。
到目前为止我有这个代码:
df %>% mutate(gender = str_sub(.$gender_age, 1, str_locate(.$gender_age, "_")[1,]-1)) %>%
mutate(age = str_sub(.$gender_age, str_locate_all(.$gender_age, "_")[[1]][2,], str_length(.$gender_age)))
# A tibble: 15 x 4
gender_age count gender age
<chr> <chr> <chr> <chr>
1 males_rating_all_ages 7.4 males _all_ages
2 males_rating_<18 8.0 males _<18
3 males_rating_18-29 7.5 males _18-29
4 males_rating_30-44 7.2 males _30-44
5 males_rating_45+ 7.5 males _45+
6 males_count_all_ages 4,197 males all_ages
7 males_count_<18 15 males <18
8 males_count_18-29 1,276 males 18-29
9 males_count_30-44 1,631 males 30-44
10 males_count_45+ 921 males 45+
11 females_rating_all_ages 7.8 femal ng_all_ages
12 females_rating_<18 8.7 femal ng_<18
13 females_rating_18-29 7.7 femal ng_18-29
14 females_rating_30-44 7.8 femal ng_30-44
15 females_rating_45+ 8.1 femal ng_45+
Warning messages:
1: Problem with `mutate()` column `gender`.
ℹ `gender = str_sub(...)`.
ℹ longer object length is not a multiple of shorter object length
2: Problem with `mutate()` column `age`.
ℹ `age = str_sub(...)`.
ℹ longer object length is not a multiple of shorter object length
但正如您所见,它为数据的每一行索引了 str_locate_all()
的相同固定值。显然这并不理想,因为第二个下划线 _
之前的字符数不同。
例如:
> str_locate_all("males_rating_all_ages", "_")
[[1]]
start end
[1,] 6 6
[2,] 13 13
[3,] 17 17
所以我必须首先在 [[1]]
上建立索引,然后是矩阵的特定行(在我的例子中 [2,]
只得到一个我可以输入到 str_sub()
表达式.
但是如果我 运行:
> str_locate_all("females_rating_all_ages", "_")
[[1]]
start end
[1,] 8 8
[2,] 15 15
[3,] 19 19
我们可以看到,当下划线前面有更多字符时,矩阵会这样指示。但是,对于我在 mutate
函数中创建的新列,它似乎已经为所有后续行采用了第一行的索引。
谁能看出我做错了什么?或者提出另一种方法来从 gender_age
中提取我想要的三列(最好使用 str_ 函数)?
而不是使用 str_locate
,使用 extract
可能更容易捕获基于正则表达式模式的组
library(dplyr)
library(stringr)
df %>%
extract(gender_age, into = c("gender", "age"),
"^([^_]+)_[^_]+_(.*)", remove = FALSE)
-输出
# A tibble: 15 x 4
gender_age gender age count
<chr> <chr> <chr> <chr>
1 males_rating_all_ages males all_ages 7.4
2 males_rating_<18 males <18 8.0
3 males_rating_18-29 males 18-29 7.5
4 males_rating_30-44 males 30-44 7.2
5 males_rating_45+ males 45+ 7.5
6 males_count_all_ages males all_ages 4,197
7 males_count_<18 males <18 15
8 males_count_18-29 males 18-29 1,276
9 males_count_30-44 males 30-44 1,631
10 males_count_45+ males 45+ 921
11 females_rating_all_ages females all_ages 7.8
12 females_rating_<18 females <18 8.7
13 females_rating_18-29 females 18-29 7.7
14 females_rating_30-44 females 30-44 7.8
15 females_rating_45+ females 45+ 8.1
OP 代码中的问题是 select 第一个 list
元素与 [[
对应 str_locate_all
。如果 list
是 length
1,它可以工作,但是,这里的 list
长度与数据的行数相同,因此 [[1]]
会 select第一行观察。这可以通过在 mutate
步骤
之前使用 rowwise
来纠正
df %>%
rowwise %>%
mutate(gender = str_sub(gender_age, 1, str_locate(gender_age, "_")[1,1]-1)) %>%
mutate(age = str_sub(gender_age, str_locate_all(gender_age,
"_")[[1]][2,1]+1, str_length(gender_age)))
# A tibble: 15 x 4
# Rowwise:
gender_age count gender age
<chr> <chr> <chr> <chr>
1 males_rating_all_ages 7.4 males all_ages
2 males_rating_<18 8.0 males <18
3 males_rating_18-29 7.5 males 18-29
4 males_rating_30-44 7.2 males 30-44
5 males_rating_45+ 7.5 males 45+
6 males_count_all_ages 4,197 males all_ages
7 males_count_<18 15 males <18
8 males_count_18-29 1,276 males 18-29
9 males_count_30-44 1,631 males 30-44
10 males_count_45+ 921 males 45+
11 females_rating_all_ages 7.8 females all_ages
12 females_rating_<18 8.7 females <18
13 females_rating_18-29 7.7 females 18-29
14 females_rating_30-44 7.8 females 30-44
15 females_rating_45+ 8.1 females 45+
然后删除 .$
(select 是整个列)或者另一个选项是使用 map
遍历 list
获取感兴趣的列来自 matrix
输出
假设我有这个数据框:
df <- structure(list(gender_age = c("males_rating_all_ages", "males_rating_<18",
"males_rating_18-29", "males_rating_30-44", "males_rating_45+",
"males_count_all_ages", "males_count_<18", "males_count_18-29",
"males_count_30-44", "males_count_45+", "females_rating_all_ages",
"females_rating_<18", "females_rating_18-29", "females_rating_30-44",
"females_rating_45+"), count = c("7.4", "8.0", "7.5", "7.2",
"7.5", "4,197", "15", "1,276", "1,631", "921", "7.8", "8.7",
"7.7", "7.8", "8.1")), row.names = c(NA, -15L), class = c("tbl_df",
"tbl", "data.frame"))
并且我想提取 gender_age
列的性别、年龄和类型(即 count
或 rating
)并将它们放在自己的列中。
到目前为止我有这个代码:
df %>% mutate(gender = str_sub(.$gender_age, 1, str_locate(.$gender_age, "_")[1,]-1)) %>%
mutate(age = str_sub(.$gender_age, str_locate_all(.$gender_age, "_")[[1]][2,], str_length(.$gender_age)))
# A tibble: 15 x 4
gender_age count gender age
<chr> <chr> <chr> <chr>
1 males_rating_all_ages 7.4 males _all_ages
2 males_rating_<18 8.0 males _<18
3 males_rating_18-29 7.5 males _18-29
4 males_rating_30-44 7.2 males _30-44
5 males_rating_45+ 7.5 males _45+
6 males_count_all_ages 4,197 males all_ages
7 males_count_<18 15 males <18
8 males_count_18-29 1,276 males 18-29
9 males_count_30-44 1,631 males 30-44
10 males_count_45+ 921 males 45+
11 females_rating_all_ages 7.8 femal ng_all_ages
12 females_rating_<18 8.7 femal ng_<18
13 females_rating_18-29 7.7 femal ng_18-29
14 females_rating_30-44 7.8 femal ng_30-44
15 females_rating_45+ 8.1 femal ng_45+
Warning messages:
1: Problem with `mutate()` column `gender`.
ℹ `gender = str_sub(...)`.
ℹ longer object length is not a multiple of shorter object length
2: Problem with `mutate()` column `age`.
ℹ `age = str_sub(...)`.
ℹ longer object length is not a multiple of shorter object length
但正如您所见,它为数据的每一行索引了 str_locate_all()
的相同固定值。显然这并不理想,因为第二个下划线 _
之前的字符数不同。
例如:
> str_locate_all("males_rating_all_ages", "_")
[[1]]
start end
[1,] 6 6
[2,] 13 13
[3,] 17 17
所以我必须首先在 [[1]]
上建立索引,然后是矩阵的特定行(在我的例子中 [2,]
只得到一个我可以输入到 str_sub()
表达式.
但是如果我 运行:
> str_locate_all("females_rating_all_ages", "_")
[[1]]
start end
[1,] 8 8
[2,] 15 15
[3,] 19 19
我们可以看到,当下划线前面有更多字符时,矩阵会这样指示。但是,对于我在 mutate
函数中创建的新列,它似乎已经为所有后续行采用了第一行的索引。
谁能看出我做错了什么?或者提出另一种方法来从 gender_age
中提取我想要的三列(最好使用 str_ 函数)?
而不是使用 str_locate
,使用 extract
可能更容易捕获基于正则表达式模式的组
library(dplyr)
library(stringr)
df %>%
extract(gender_age, into = c("gender", "age"),
"^([^_]+)_[^_]+_(.*)", remove = FALSE)
-输出
# A tibble: 15 x 4
gender_age gender age count
<chr> <chr> <chr> <chr>
1 males_rating_all_ages males all_ages 7.4
2 males_rating_<18 males <18 8.0
3 males_rating_18-29 males 18-29 7.5
4 males_rating_30-44 males 30-44 7.2
5 males_rating_45+ males 45+ 7.5
6 males_count_all_ages males all_ages 4,197
7 males_count_<18 males <18 15
8 males_count_18-29 males 18-29 1,276
9 males_count_30-44 males 30-44 1,631
10 males_count_45+ males 45+ 921
11 females_rating_all_ages females all_ages 7.8
12 females_rating_<18 females <18 8.7
13 females_rating_18-29 females 18-29 7.7
14 females_rating_30-44 females 30-44 7.8
15 females_rating_45+ females 45+ 8.1
OP 代码中的问题是 select 第一个 list
元素与 [[
对应 str_locate_all
。如果 list
是 length
1,它可以工作,但是,这里的 list
长度与数据的行数相同,因此 [[1]]
会 select第一行观察。这可以通过在 mutate
步骤
rowwise
来纠正
df %>%
rowwise %>%
mutate(gender = str_sub(gender_age, 1, str_locate(gender_age, "_")[1,1]-1)) %>%
mutate(age = str_sub(gender_age, str_locate_all(gender_age,
"_")[[1]][2,1]+1, str_length(gender_age)))
# A tibble: 15 x 4
# Rowwise:
gender_age count gender age
<chr> <chr> <chr> <chr>
1 males_rating_all_ages 7.4 males all_ages
2 males_rating_<18 8.0 males <18
3 males_rating_18-29 7.5 males 18-29
4 males_rating_30-44 7.2 males 30-44
5 males_rating_45+ 7.5 males 45+
6 males_count_all_ages 4,197 males all_ages
7 males_count_<18 15 males <18
8 males_count_18-29 1,276 males 18-29
9 males_count_30-44 1,631 males 30-44
10 males_count_45+ 921 males 45+
11 females_rating_all_ages 7.8 females all_ages
12 females_rating_<18 8.7 females <18
13 females_rating_18-29 7.7 females 18-29
14 females_rating_30-44 7.8 females 30-44
15 females_rating_45+ 8.1 females 45+
然后删除 .$
(select 是整个列)或者另一个选项是使用 map
遍历 list
获取感兴趣的列来自 matrix
输出