在字符串中搜索单词,return 个变量,其中单词在 R/tidyverse 中的字符串中的位置

Search string for word, return variables with word position(s) within string in R/tidyverse

我有一个包含 39 个独特的字母数字五码的列表,例如我想检查的数据帧变量 (pattern$single_codes) 中的“KFC10”、“TKC10”、“DG012” 如果它们包含在不同数据帧中的另一个变量 (database$multi_codes) 中。

(multi_codes) 是一个字符变量,可以包含 0 到 30 个字母数字,不一定是唯一的(可能会出现重复)由空格分隔的代码,它会例如看起来像“TKC10 JFB30 TKC10 DG001 DG012 DG002 TKC10 UGC12 DG012 TKC10”,在此示例中,它在字符串中包含 10 个空格分隔的代码。

我想要的最终结果是将 39 个额外变量添加到以相应的“single_codes[=57”命名的“数据库”数据框中=]" 值 例如(数据库$KFC10, 数据库$TKC10, 数据库$DG012 ...).

如果相应的“single_codes”值,则 39 个额外变量中的每一个都会有一个 NA 在 数据库 $multi_codes 的相应行中找不到(注意:搜索并返回字符串中单词的位置,而不是字符)。另一方面,如果找到“single_codes”值,它应该显示 空格分隔的“database$multi_codes”字符串中的相应位置。

例如,如果我只搜索前面提到的那些代码,在提到的单个字符串中,我会在数据库中得到 3 个额外的变量:

  1. database$KFC10 变量的值为:NA.
  2. database$TKC10 变量的值为:1,3,7,10.
  3. database$DG012 变量的值为:5,9.

我该怎么做?

我在下面做了一个不完整的尝试(不是 working/useful),但我不确定如何完成它。很高兴收到与我完全不同的方法。

模式:

pattern <- structure(list(single_codes = c("JDW97", "JDW98", "JCA05", "JCA38", 
"JCA42", "JCA45", "AF021", "JCA96", "JCA98", "JCC00", "JCC10", 
"TJA10", "JCC20", "TJK01", "JCC96", "DV093", "JCD10", "JCW98", 
"JDA05", "JDA38", "QM007", "JDA52", "JDA63", "JDC00", "JDC10", 
"DT022", "JDC20", "JDC30", "JDC40", "JDC96", "JDC97", "AF063", 
"JDD01", "JDD96", "JDW96", "AF037", "UJD02", "UJD05", "PN000"
)), row.names = c(NA, -39L), class = "data.frame")

数据库:

database <- structure(list(id = c(3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 6, 6, 6255, 
6255, 6255, 6255, 6255, 6255, 6255, 6255, 6255, 6255, 6255, 6255, 
6255, 6255, 6255, 6255, 6255, 6255, 6255, 6255, 7290, 7290, 7290, 
7290, 7290, 7290, 11832, 11832, 13991, 13991, 13991, 13991, 13991, 
13991, 13991, 13991, 13991, 13991), multi_codes = c("", "AF063", 
"UJD05", "JCF12 JFF00 UJD02", "", "", "TPX10", "", "UJD02", "AV034 DT016 JDC00 DV065 QB008", 
"UGC12 UJC02 UEN12 UEN05 XXA00", "JCC10 JCC10 DR036 DR036 DR029 DR029", 
"8340 8440", "JCA45", "", "", "AF070 AF012", "FNG05 AF037", "AF021 AF063", 
"AF063 AF012", "AF037", "AF037", "AF021 AF070", "ABC56", "UJD05", 
"", "XV015", "", "AF021 AF064 CKD05 DR016 DR029", "XS007", "", 
"AF063 AP029 AP051 DG017 DG021 DG023 DG024 DR029 DR055 DV065 GBB00 SP299 SP311 TJD10 TKC20 XV018 ZXH10 ZXH60", 
"DR029 JDA63 JDH35 JFB43 JWA00 UJD02", "DV051 DV093 DV094 PM003 PN000 PN001 PT000 QB003 QG001 QG001 QG001 QM007 QM007 QM007 QM007 QM007 QM015 QM015 QM015 QM015 QM015 QM015 QM015 QM015 QN000 QN000 QT007 QV012 XS005 XS005", 
"UJD02", "DV093 DV094 PN000 PN000 DV093 QG001 QG001 QG001 QG001 QG001 QG001 QG001 QG001 QG003 QG003 QG003 QG003 QG003 QG003 QM007 QM007 QM007 QM007 QM007 QM007 QM007 QM007 QM007 QM007 QM007", 
"JCA55 JDA55", "JCA55 JDH35 UJD02", "DT022 DT022 DT022 DT022 DT022 DT022 DT022 DV051 DV053 DV055 DV057 DV057 DV057 DV057 DV057 DV057 DV057 DV057 DV057 DV057 DV057 DV057 DV057 DV057 DV057 DV057 DV057 DV057 DV057 DV057", 
"DT022 DT022 DU007 DV057 DV057 DV057 DV057 DV057 DV057 DV058 DV058 DV058 DV076 DV076 DV076 DV076 DV076 DV076 GB003", 
"8841", "", "", "AF063", "JDC30", "AF063", "TJA10 DV093 DV094 PA000 PM000 PN000 PN000 PT000 QB001 QB001 QB001 QB003 QG003 QM000 QM000 QM000 QM015 QM017 QM017 QM017 QT007 XS011 XS011 XS910 XS910 XS910 XS910 XS910 XS910 XS910", 
"JKA21 TJK01", "XV015", "DR036 TJA10 TJA10 XV018")), row.names = c(1L, 
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 12L, 13L, 17L, 36L, 40L, 41L, 
42L, 43L, 44L, 45L, 46L, 47L, 48L, 49L, 50L, 51L, 52L, 53L, 54L, 
55L, 56L, 57L, 58L, 65L, 66L, 67L, 68L, 69L, 70L, 71L, 72L, 73L, 
74L, 75L, 76L, 77L, 78L, 82L, 83L, 84L, 85L), class = "data.frame")

到目前为止我的(非工作)尝试:

for (i in 1:length(pattern$single_codes)) {
  print(paste(which(str_split(database$multi_codes, boundary("word"))[[1]] == pattern$single_codes[i]), collapse=","))
}

这是一种处理方法。

func <- function(a, b) paste(which(a %in% b), collapse = ",")
out <- outer(strsplit(database$multi_codes, " "),
             setNames(nm = pattern$single_codes),
             function(a, b) mapply(func, a, b))

dim(out)
# [1] 50 39

out[32:37,20:30]
#      JDA38 QM007                              JDA52 JDA63 JDC00 JDC10 DT022 JDC20 JDC30 JDC40 JDC96
# [1,] ""    ""                                 ""    ""    ""    ""    ""    ""    ""    ""    ""   
# [2,] ""    ""                                 ""    "2"   ""    ""    ""    ""    ""    ""    ""   
# [3,] ""    "12,13,14,15,16"                   ""    ""    ""    ""    ""    ""    ""    ""    ""   
# [4,] ""    ""                                 ""    ""    ""    ""    ""    ""    ""    ""    ""   
# [5,] ""    "20,21,22,23,24,25,26,27,28,29,30" ""    ""    ""    ""    ""    ""    ""    ""    ""   
# [6,] ""    ""                                 ""    ""    ""    ""    ""    ""    ""    ""    ""   

每一行都是来自 database 的一行,每一列都是来自 pattern 的一个字符串。这可以 cbind 相对直接地使用:

database2 <- cbind(database, out)
database2[32:37,]
#      id                                                                                                                                                                         multi_codes JDW97 JDW98 JCA05 JCA38 JCA42 JCA45 AF021 JCA96 JCA98 JCC00
# 58 6255                                                                         AF063 AP029 AP051 DG017 DG021 DG023 DG024 DR029 DR055 DV065 GBB00 SP299 SP311 TJD10 TKC20 XV018 ZXH10 ZXH60                                                            
# 65 7290                                                                                                                                                 DR029 JDA63 JDH35 JFB43 JWA00 UJD02                                                            
# 66 7290 DV051 DV093 DV094 PM003 PN000 PN001 PT000 QB003 QG001 QG001 QG001 QM007 QM007 QM007 QM007 QM007 QM015 QM015 QM015 QM015 QM015 QM015 QM015 QM015 QN000 QN000 QT007 QV012 XS005 XS005                                                            
# 67 7290                                                                                                                                                                               UJD02                                                            
# 68 7290 DV093 DV094 PN000 PN000 DV093 QG001 QG001 QG001 QG001 QG001 QG001 QG001 QG001 QG003 QG003 QG003 QG003 QG003 QG003 QM007 QM007 QM007 QM007 QM007 QM007 QM007 QM007 QM007 QM007 QM007                                                            
# 69 7290                                                                                                                                                                         JCA55 JDA55                                                            
#    JCC10 TJA10 JCC20 TJK01 JCC96 DV093 JCD10 JCW98 JDA05 JDA38                            QM007 JDA52 JDA63 JDC00 JDC10 DT022 JDC20 JDC30 JDC40 JDC96 JDC97 AF063 JDD01 JDD96 JDW96 AF037 UJD02 UJD05 PN000
# 58                                                                                                                                                              1                                          
# 65                                                                                                        2                                                                                   6            
# 66                                   2                                           12,13,14,15,16                                                                                                           5
# 67                                                                                                                                                                                            1            
# 68                                 1,5                         20,21,22,23,24,25,26,27,28,29,30                                                                                                         3,4
# 69                                                                                                                                                                                                         

dplyr中,可以用bind_cols(database, out).

NA 代替空字符串可能会有用。为此,更改为

func <- function(a, b) { o <- which(a %in% b); if (length(o)) paste(o, collapse = ",") else NA; }
out <- outer(strsplit(database$multi_codes, " "), setNames(nm = pattern$single_codes), function(a, b) mapply(func, a, b))
database2 <- cbind(database, out)

database2[32:37,]
#      id                                                                                                                                                                         multi_codes JDW97 JDW98 JCA05 JCA38 JCA42 JCA45 AF021 JCA96 JCA98 JCC00
# 58 6255                                                                         AF063 AP029 AP051 DG017 DG021 DG023 DG024 DR029 DR055 DV065 GBB00 SP299 SP311 TJD10 TKC20 XV018 ZXH10 ZXH60  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>
# 65 7290                                                                                                                                                 DR029 JDA63 JDH35 JFB43 JWA00 UJD02  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>
# 66 7290 DV051 DV093 DV094 PM003 PN000 PN001 PT000 QB003 QG001 QG001 QG001 QM007 QM007 QM007 QM007 QM007 QM015 QM015 QM015 QM015 QM015 QM015 QM015 QM015 QN000 QN000 QT007 QV012 XS005 XS005  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>
# 67 7290                                                                                                                                                                               UJD02  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>
# 68 7290 DV093 DV094 PN000 PN000 DV093 QG001 QG001 QG001 QG001 QG001 QG001 QG001 QG001 QG003 QG003 QG003 QG003 QG003 QG003 QM007 QM007 QM007 QM007 QM007 QM007 QM007 QM007 QM007 QM007 QM007  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>
# 69 7290                                                                                                                                                                         JCA55 JDA55  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>
#    JCC10 TJA10 JCC20 TJK01 JCC96 DV093 JCD10 JCW98 JDA05 JDA38                            QM007 JDA52 JDA63 JDC00 JDC10 DT022 JDC20 JDC30 JDC40 JDC96 JDC97 AF063 JDD01 JDD96 JDW96 AF037 UJD02 UJD05 PN000
# 58  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>                             <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>     1  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>
# 65  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>                             <NA>  <NA>     2  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>     6  <NA>  <NA>
# 66  <NA>  <NA>  <NA>  <NA>  <NA>     2  <NA>  <NA>  <NA>  <NA>                   12,13,14,15,16  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>     5
# 67  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>                             <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>     1  <NA>  <NA>
# 68  <NA>  <NA>  <NA>  <NA>  <NA>   1,5  <NA>  <NA>  <NA>  <NA> 20,21,22,23,24,25,26,27,28,29,30  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>   3,4
# 69  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>                             <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>