从 R 中字符变量的部分匹配创建多个协变量

Question

我有一个包含字母数字字符变量的大型数据框，特别是它包含有关品种组成的信息，我需要从中创建品种分数的协变量。

品种组成列包含 7000 多个品种组合，长度不一（即有些动物有 2 个品种，有些有 10 个）。品种始终由两个字母代码标识，该品种的分数是其后的系数除以该动物的所有品种系数之和（coeftotal）。

我正在寻找一种方法来从这个变量（品种）中获取系数，并使协变量对应于 7 个特定品种（SU、DP、RV、RI、CD、PO、HA）的比例。数据中存在更多的品种代码，一些动物甚至可能有 none 个感兴趣的品种。数据框包含超过 100 万条记录，我无法找到解决我的问题的有效解决方案，该解决方案不涉及针对每个特定品种代码和每个感兴趣系数（例如 SU1）的无数 grepl /if else 语句至 SUx）。此外，由于系数之和不等于相同的数，问题变得复杂。下面是我的数据框和所需输出的示例。任何想法表示赞赏！

   id <- c(1:8)
   breed <- c("SU1","DP1RI1","DP1RI1RV1SU1","DP3XX1","SU9RV7","XX1","DP7XX1","SU32RV16DP8RI8")
   sheep <- data.frame(id,breed)

   id    breed           coeftot     SU     DP     RV     RI     CD     PO     HA
   1     SU1             1           1      0      0      0      0      0      0
   2     DP1RI1          2           0      0.5    0      0.5    0      0      0
   3     DP1RI1RV1SU1    4           0.25   0.25   0.25   0.25   0      0      0
   4     DP3XX1          4           0      0.75   0      0      0      0      0
   5     SU9RV7          16          0.5625 0      0.4375 0      0      0      0
   6     XX1             1           0      0      0      0      0      0      0
   7     DP7XX1          8           0.875  0      0      0      0      0      0
   8     SU32RV16DP8RI8  64          0.5    0.125  0.25   0.125  0      0      0

Answer 1

如果您需要内存和速度效率，data.table 包很好。 stringi 对字符串操作有很大帮助。

library(stringi)

breed_codes <- unique(unlist(stri_extract_all_regex(
  sheep[["breed"]], "[A-Z]+"
)))
breed_codes
# "SU" "DP" "RI" "RV" "XX"
patterns <- sprintf("(?<=%s)\d+", breed_codes)
patterns
# "(?<=SU)\d+" "(?<=DP)\d+" "(?<=RI)\d+" "(?<=RV)\d+" "(?<=XX)\d+"

首先我们使用正则表达式提取受试者品种集中的所有品种代码，这些代码是连续的大写字母（[A-Z]+）。接下来，我们将创建一个正则表达式来捕获它们中的每一个的系数。

我们想要捕获每个品种代码 ((?<=SU)) 前面的任意数量的数字 (\d+)。我们将遍历每个品种并使用模式捕获的数字为其分配一列。如果受试者的品种集没有代码，那么我们将其设置为 0。

library(data.table)

setDT(sheep)
set(
  sheep,
  j = breed_codes,
  value = lapply(
    patterns,
    function(pat) {
      digits <- stri_extract_first_regex(sheep[["breed"]], pat)
      digits[is.na(digits)] <- "0"
      breed_coef <- as.integer(digits)
      breed_coef
    }
  )
)

最后，我们将每一行的系数相加作为总和。

sheep[, coeftot := rowSums(.SD), .SDcols = breed_codes]

如果您想将特定七种以外的品种合并到一个 "other" 列中，那么我们只需识别它们，按行对它们求和，然后将它们从数据集中删除。

special_breeds <- c("SU", "DP", "RV", "RI", "CD", "PO", "HA")
non_special_breeds <- setdiff(breed_codes, special_breeds)

sheep[, other := rowSums(.SD), .SDcols = non_special_breeds]
set(sheep, j = non_special_breeds, value = NULL)
sheep
#    id          breed SU DP RI RV coeftot other
# 1:  1            SU1  1  0  0  0       1     0
# 2:  2         DP1RI1  0  1  1  0       2     0
# 3:  3   DP1RI1RV1SU1  1  1  1  1       4     0
# 4:  4         DP3XX1  0  3  0  0       4     1
# 5:  5         SU9RV7  9  0  0  7      16     0
# 6:  6            XX1  0  0  0  0       1     1
# 7:  7         DP7XX1  0  7  0  0       8     1
# 8:  8 SU32RV16DP8RI8 32  8  8 16      64     0

从 R 中字符变量的部分匹配创建多个协变量

Creating Multiple Covariates from Partial Matches of a Character Variable in R

if-statement

r

grepl

recode