访问 str_locate 和 str_locate_all 的索引

Question

假设我有这个数据框：

df <- structure(list(gender_age = c("males_rating_all_ages", "males_rating_<18", 
"males_rating_18-29", "males_rating_30-44", "males_rating_45+", 
"males_count_all_ages", "males_count_<18", "males_count_18-29", 
"males_count_30-44", "males_count_45+", "females_rating_all_ages", 
"females_rating_<18", "females_rating_18-29", "females_rating_30-44", 
"females_rating_45+"), count = c("7.4", "8.0", "7.5", "7.2", 
"7.5", "4,197", "15", "1,276", "1,631", "921", "7.8", "8.7", 
"7.7", "7.8", "8.1")), row.names = c(NA, -15L), class = c("tbl_df", 
"tbl", "data.frame"))

并且我想提取 gender_age 列的性别、年龄和类型（即 count 或 rating）并将它们放在自己的列中。

到目前为止我有这个代码：

df %>% mutate(gender = str_sub(.$gender_age, 1, str_locate(.$gender_age, "_")[1,]-1)) %>% 
  mutate(age    = str_sub(.$gender_age, str_locate_all(.$gender_age, "_")[[1]][2,], str_length(.$gender_age)))

# A tibble: 15 x 4
   gender_age              count gender age        
   <chr>                   <chr> <chr>  <chr>      
 1 males_rating_all_ages   7.4   males  _all_ages  
 2 males_rating_<18        8.0   males  _<18       
 3 males_rating_18-29      7.5   males  _18-29     
 4 males_rating_30-44      7.2   males  _30-44     
 5 males_rating_45+        7.5   males  _45+       
 6 males_count_all_ages    4,197 males  all_ages   
 7 males_count_<18         15    males  <18        
 8 males_count_18-29       1,276 males  18-29      
 9 males_count_30-44       1,631 males  30-44      
10 males_count_45+         921   males  45+        
11 females_rating_all_ages 7.8   femal  ng_all_ages
12 females_rating_<18      8.7   femal  ng_<18     
13 females_rating_18-29    7.7   femal  ng_18-29   
14 females_rating_30-44    7.8   femal  ng_30-44   
15 females_rating_45+      8.1   femal  ng_45+ 
    
Warning messages:
1: Problem with `mutate()` column `gender`.
ℹ `gender = str_sub(...)`.
ℹ longer object length is not a multiple of shorter object length 
2: Problem with `mutate()` column `age`.
ℹ `age = str_sub(...)`.
ℹ longer object length is not a multiple of shorter object length

但正如您所见，它为数据的每一行索引了 str_locate_all() 的相同固定值。显然这并不理想，因为第二个下划线 _ 之前的字符数不同。

例如：

> str_locate_all("males_rating_all_ages", "_")
[[1]]
     start end
[1,]     6   6
[2,]    13  13
[3,]    17  17

所以我必须首先在 [[1]] 上建立索引，然后是矩阵的特定行（在我的例子中 [2,] 只得到一个我可以输入到 str_sub() 表达式.

但是如果我运行:

> str_locate_all("females_rating_all_ages", "_")
[[1]]
     start end
[1,]     8   8
[2,]    15  15
[3,]    19  19

我们可以看到，当下划线前面有更多字符时，矩阵会这样指示。但是，对于我在 mutate 函数中创建的新列，它似乎已经为所有后续行采用了第一行的索引。

谁能看出我做错了什么？或者提出另一种方法来从 gender_age 中提取我想要的三列（最好使用 str_ 函数）？

Answer 1

而不是使用 str_locate，使用 extract 可能更容易捕获基于正则表达式模式的组

library(dplyr)
library(stringr)
df %>%
    extract(gender_age, into = c("gender", "age"), 
         "^([^_]+)_[^_]+_(.*)", remove = FALSE)

-输出

# A tibble: 15 x 4
   gender_age              gender  age      count
   <chr>                   <chr>   <chr>    <chr>
 1 males_rating_all_ages   males   all_ages 7.4  
 2 males_rating_<18        males   <18      8.0  
 3 males_rating_18-29      males   18-29    7.5  
 4 males_rating_30-44      males   30-44    7.2  
 5 males_rating_45+        males   45+      7.5  
 6 males_count_all_ages    males   all_ages 4,197
 7 males_count_<18         males   <18      15   
 8 males_count_18-29       males   18-29    1,276
 9 males_count_30-44       males   30-44    1,631
10 males_count_45+         males   45+      921  
11 females_rating_all_ages females all_ages 7.8  
12 females_rating_<18      females <18      8.7  
13 females_rating_18-29    females 18-29    7.7  
14 females_rating_30-44    females 30-44    7.8  
15 females_rating_45+      females 45+      8.1

OP 代码中的问题是 select 第一个 list 元素与 [[ 对应 str_locate_all。如果 list 是 length 1，它可以工作，但是，这里的 list 长度与数据的行数相同，因此 [[1]] 会 select第一行观察。这可以通过在 mutate 步骤

之前使用 rowwise 来纠正

df %>%
   rowwise %>%
   mutate(gender = str_sub(gender_age, 1, str_locate(gender_age, "_")[1,1]-1)) %>% 
    mutate(age    = str_sub(gender_age, str_locate_all(gender_age, 
          "_")[[1]][2,1]+1, str_length(gender_age)))
# A tibble: 15 x 4
# Rowwise: 
   gender_age              count gender  age     
   <chr>                   <chr> <chr>   <chr>   
 1 males_rating_all_ages   7.4   males   all_ages
 2 males_rating_<18        8.0   males   <18     
 3 males_rating_18-29      7.5   males   18-29   
 4 males_rating_30-44      7.2   males   30-44   
 5 males_rating_45+        7.5   males   45+     
 6 males_count_all_ages    4,197 males   all_ages
 7 males_count_<18         15    males   <18     
 8 males_count_18-29       1,276 males   18-29   
 9 males_count_30-44       1,631 males   30-44   
10 males_count_45+         921   males   45+     
11 females_rating_all_ages 7.8   females all_ages
12 females_rating_<18      8.7   females <18     
13 females_rating_18-29    7.7   females 18-29   
14 females_rating_30-44    7.8   females 30-44   
15 females_rating_45+      8.1   females 45+

然后删除 .$（select 是整个列）或者另一个选项是使用 map 遍历 list 获取感兴趣的列来自 matrix 输出

访问 str_locate 和 str_locate_all 的索引

Accessing indexes of str_locate and str_locate_all

r

stringr

dplyr