如何根据 R 中的多个条件将变量解析为多个列？

Question

我是 R 的新手，所以请多多包涵。我正在查看监禁数据，并且有一个变量 conviction，这是一个看起来像这样的混乱字符串：

[1] "Ct. 1: Conspiracy to distribute"                                                                         
[2] "Aggravated Assault"                                                                                      
[3] "Ct. 1: Possession of prohibited object; Ct. 2: criminal forfeiture"                                      
[4] "Ct. 1-6: Human Trafficking; Cts. 7, 8 Unlawful contact; Ct. 11: Involuntary Servitude; Ct. 36: Smuggling"

理想情况下，我想做两件事。首先，我想将 Ct. 解析为多列。对于前三行，数据将如下所示：

     convictions                              conviction_1                      conviction_2                    
[1,] "Ct. 1: Conspiracy to distribute"        "Conspiracy to distribute"        NA                   
[2,] "Aggravated Assault"                     "Aggravated Assault"              NA                   
[3,] "Ct. 1: Possession of prohibited object" "Possession of prohibited object" "criminal forfeiture"

但是当我到达第三行时事情变得很麻烦，因为我想将字符串的第一部分 (Ct. 1-6: Human Trafficking) 解析为 6 列，然后 Ct. 7,8: Unlawful contact 解析为另外 2 列列。

第二部分是我想生成一个变量 convictions_total，它会在 Ct: 之后的 conviction 字符串中找到最大的数字。对于我在此处包含的三个示例条目，convictions_total 看起来像：

[1]  1  2 36

这是我用来解析更直接的字符串变量的代码，但我不确定如何针对这个变量调整它：

cols <- data.frame(str_split_fixed(data$convictions`,",",Inf))
colnames(cols) <- paste0("conviction_",rep(1:length(cols)))
data <- cbind(data,cols)

提前致谢！

Answer 1

以下适用于您的示例，无需使用太多正则表达式，主要是数字提取或其他字符串检测：

library(stringr)
library(magrittr)
library(purrr)
library(plyr)

convictions_total <- sapply(stringr::str_extract_all(convictions, "\d+"), 
                            function(x) max(as.numeric(x), 1))
convictions_split <- strsplit(convictions, ";")


reps <- lapply(convictions_split, FUN = function(x) {
    sapply(x, FUN = function(i) {
      num <- paste(stringr::str_extract_all(i, "[\d+\-,]")[[1]], collapse = "")
      # "-" indicates a range: take largest value
      if (stringr::str_detect(num, "-")){
        stringr::str_extract_all(num, "\d+") %>% 
          unlist() %>% 
          as.numeric() %>%
          max() %>%  
          return()
      # "," indicates a sequence: get length of sequence
      } else if(stringr::str_detect(num, ",")){
        stringr::str_count(num, ",") + 1 %>% 
          as.numeric() %>%
          return()
      # otherwise return 1
      } else {
        return(1)
      }
    })
  })

convictions_str <- lapply(convictions_split, 
                          function(x) gsub(".*\d:?\s(.*)$", "\1", x))

df <- purrr::map2(convictions_str, reps, rep) %>% 
  plyr::ldply(rbind) %>% 
  cbind(convictions_total, .) %>% 
  data.frame() %>% 
  dplyr::rename_with(~ gsub("X", "conviction_", .x), starts_with("X"))

输出

  convictions_total                    conviction_1        conviction_2      conviction_3
1                 1        Conspiracy to distribute                <NA>              <NA>
2                 1              Aggravated Assault                <NA>              <NA>
3                 2 Possession of prohibited object criminal forfeiture              <NA>
4                36               Human Trafficking   Human Trafficking Human Trafficking
       conviction_4      conviction_5      conviction_6     conviction_7     conviction_8
1              <NA>              <NA>              <NA>             <NA>             <NA>
2              <NA>              <NA>              <NA>             <NA>             <NA>
3              <NA>              <NA>              <NA>             <NA>             <NA>
4 Human Trafficking Human Trafficking Human Trafficking Unlawful contact Unlawful contact
           conviction_9 conviction_10
1                  <NA>          <NA>
2                  <NA>          <NA>
3                  <NA>          <NA>
4 Involuntary Servitude     Smuggling

数据

convictions <- c("Ct. 1: Conspiracy to distribute",
                 "Aggravated Assault",
                 "Ct. 1: Possession of prohibited object; Ct.: 2 criminal forfeiture",
                 "Ct. 1-6: Human Trafficking; Cts. 7, 8 Unlawful contact; Ct. 11: Involuntary Servitude; Ct. 36: Smuggling")

工作原理

convictions_total 很容易通过使用 stringr::str_extract_all 来提取 convictions 中每一行的所有数字。 return 是一个向量列表。 sapply 然后从列表中的每个向量中取最大值，return 是一个向量。
reps 是一个列表，其中的元素对应于 convictions 的元素，它存储一个数值向量，表示每个定罪计数要重复多少次。

该代码首先将 convictions 拆分为一个向量列表，其中向量包含以下提取的信息：数字 (\d+)、破折号 (\-) 和逗号(,)。该逻辑通过搜索这些字符串提取来工作：

首先，如果它在定罪计数中找到 "-"，则表示一个范围，它再次取最大值。例如 "Ct. 1-6: Human Trafficking" 将 return 6.
接下来，如果它没有找到 "-"，而是表示计数分隔符的 ","。所以它计算逗号分隔符的数量并加一。例如 "Cts. 7, 8 Unlawful contact" 将 return 2
假定其他所有内容仅重复一次，因为它不是顺序列表或范围。

reps
[[1]]
Ct. 1: Conspiracy to distribute 
                              1 

[[2]]
Aggravated Assault 
                 1 

[[3]]
Ct. 1: Possession of prohibited object             Ct.: 2 criminal forfeiture 
                                     1                                      1 

[[4]]
    Ct. 1-6: Human Trafficking     Cts. 7, 8 Unlawful contact  Ct. 11: Involuntary Servitude 
                             6                              2                              1 
             Ct. 36: Smuggling 
                             1

convictions_str 只是提取实际的定罪信息。例如，代码将从 "Ct. 1: Conspiracy to distribute" 中提取所有定罪的 "Conspiracy to distribute" 等等。

[[1]]
[1] "Conspiracy to distribute"

[[2]]
[1] "Aggravated Assault"

[[3]]
[1] "Possession of prohibited object" "criminal forfeiture"            

[[4]]
[1] "Human Trafficking"     "Unlawful contact"      "Involuntary Servitude"
[4] "Smuggling"

此时reps和convictions_str有一个相关的结构：

convictions_str[[1]][1] 应重复 reps[[1]][1] 次
convictions_str[[1]][2] 应重复 reps[[1]][2] 次

purrr::map2 利用此结构，使用 rep 函数通过存储在 reps 中的值重复 convictions_str 中的元素并输出一个列表。 plyr::ldply 行绑定此列表填充 NA 因为不是每个人都有相同数量的定罪。 cbind 添加列 convictions_total，dplyr::rename_with 更改列名称。

Answer 2

经过两天的探索之后，我想出了一个简洁的@LMc 代码版本，它最终运行得更好，因为调用 plyr 弄乱了我写的其他代码：

test_data <- 
  tibble(id = 1:5, 
         convictions = c("Ct. 1: Conspiracy to distribute"    ,                                                                     
                         "Aggravated Assault"              ,                                                                        
                         "Ct. 1: Possession of prohibited object; Ct. 2: criminal forfeiture"  ,                                    
                         "Ct. 1-6: Human Trafficking; Cts. 7, 8 Unlawful contact; Ct. 11: Involuntary Servitude; Ct. 36: Smuggling 50 grams",
                         "Ct. 1: Conspiracy; Cts. 2-7: Wire Fraud; Cts. 8-28:  Money Laundering"))
test_data <- test_data %>% 
  mutate(c2 = convictions) #this just duplicates the original variable convictions because I want to preserve it

test_data <- test_data %>%
  separate_rows(c2, sep = ";") %>%
  mutate(c2 = str_remove(c2, "Ct(s)?(\. )(\d|-|:|,|\s)+")) %>%
  group_by(id) %>%
  mutate(conviction_number = paste0("c_", row_number())) %>%
  pivot_wider(values_from = c2, names_from = conviction_number) 


test_data <- test_data %>% 
  mutate(c2 = convictions) #again, just preserving the original variable

test_data <- test_data %>%
  separate_rows(c2, sep = ";") %>% 
  mutate(total_counts = as.numeric(ifelse(is.na(str_extract(c2, "((?<=\-)\d+)")), str_extract(c2, "\d+"), str_extract(c2, "((?<=\-)\d+)")))) %>% 
  mutate(total_counts = ifelse(is.na(total_counts), 1, total_counts)) %>% 
  group_by(id) %>% 
  slice_max(total_counts)

产生以下数据帧：

     id convictions                                                  c_1                c_2           c_3            c_4          c2                 total_counts
  <int> <chr>                                                        <chr>              <chr>         <chr>          <chr>        <chr>                     <dbl>
1     1 Ct. 1: Conspiracy to distribute                              Conspiracy to dis~  NA            NA             NA          "Ct. 1: Conspirac~            1
2     2 Aggravated Assault                                           Aggravated Assault  NA            NA             NA          "Aggravated Assau~            1
3     3 Ct. 1: Possession of prohibited object; Ct. 2: criminal for~ Possession of pro~ " criminal f~  NA             NA          " Ct. 2: criminal~            2
4     4 Ct. 1-6: Human Trafficking; Cts. 7, 8 Unlawful contact; Ct.~ Human Trafficking  " Unlawful c~ " Involuntary~ " Smuggling~ " Ct. 36: Smuggli~           36
5     5 Ct. 1: Conspiracy; Cts. 2-7: Wire Fraud; Cts. 8-28:  Money ~ Conspiracy         " Wire Fraud" " Money Laund~  NA          " Cts. 8-28:  Mon~           28

第一段代码将计数解析为单独的行，然后转向回 c_ 列。第二个代码块执行相同的解析，但随后查看每个条目以解析出数字，而不是单词。

//d+ 寻找任何数字，但事实证明我有看起来像 Cts. 2-7 的数据，我想要值 7，而不是 2。

((?<=\-)\d+)")) 查找连字符，然后解析其后的数字。如果没有连字符，则默认返回 \d+.

最后，slice_max 根据 total_counts 的最高值将数据折叠为每个 ID 1 个条目。

如何根据 R 中的多个条件将变量解析为多个列？

How can I parse a variable into multiple columns according to multiple conditions in R?

r

string-parsing