两个列表中跨向量的元素的部分交集

Partial intersection of elements across vectors in two lists

我有一个这样的列表:

mylist <- list(PP = c("PP 1", "OMITTED"),
           IN01 = c("DID NOT PARTICIPATE", "PARTICIPATED", "OMITTED"),                     
           RD1 = c("YES", "NO", "NOT REACHED", "INVALID", "OMITTED"),
           RD2 = c("YES", "NO", "NOT REACHED", "NOT AN OPTION", "OMITTED"),
           LOS = c("LESS THAN 3", "3 TO 100", "100 TO 500", "MORE THAN 500", "LOGICALLY NOT APPLICABLE", "OMITTED"),
           COM = c("BAN", "SBAN", "RAL"), 
           VR1 = c("WITHIN 30", "WITHIN 200", "NOT AVAILABLE", "OMITTED"),                         
           INF = c("A LOT", "SOME", "LITTLE OR NO", "NOT APPLICABLE", "OMITTED"),               
           IST = c("FULL-TIME", "PART-TIME", "FULL STAFFED", "NOT STAFFED", "LOGICALLY NOT APPLICABLE", "OMITTED"),
           CMP = c("ALL", "MOST", "SOME", "NONE", "LOGICALLY NOT APPLICABLE", "OMITTED"))

我还有一个这样的列表:

matchlist <- list("INVALID", c("INVALID", "OMITTED OR INVALID"),
c("INVALID", "OMITTED"), "OMITTED", c("NOT REACHED", "INVALID", "OMITTED"),
c("LOGICALLY NOT APPLICABLE", "INVALID", "OMITTED"),
c("LOGICALLY NOT APPLICABLE", "INVALID", "OMITTED OR INVALID"),
c("Not applicable", "Not stated"), c("Not reached", "Not administered/missing by design", "Presented but not answered/invalid"),
c("Not administered/missing by design", "Presented but not answered/invalid"),
"OMITTED OR INVALID",
c("LOGICALLY NOT APPLICABLE", "OMITTED OR INVALID"),
c("NOT REACHED", "OMITTED"),
c("NOT APPLICABLE", "OMITTED"), 
c("LOGICALLY NOT APPLICABLE", "OMITTED"),
c("LOGICALLY NOT APPLICABLE", "NOT REACHED", "OMITTED"),
"NOT EXCLUDED", c("Default", "Not applicable", "Not stated"), c("Valid Skip", "Not Reached", "Not Applicable", "Invalid", "No Response"),
c("Not administered", "Omitted"),
c("NOT REACHED", "INVALID RESPONSE", "OMITTED"),
c("INVALID RESPONSE", "OMITTED"))

如您所见,matchlist 中的某些向量部分匹配 mylist 中的向量。在某些情况下,matchlist 中的向量与 mylist 中的部分向量完全匹配。例如,mylistRD1 的最后一个值与 matchlist 的第五个分量中的向量匹配,但 RD2 不匹配它,尽管存在公共值。 RD2mylist 中的值("NOT REACHED"、"NOT AN OPTION"、"OMITTED") 一起并按此顺序 不在 matchlist 中的任何向量中有一个匹配项。 mylist.

COM的值也是一样的

我想要实现的是将 mylist 中的每个向量中的元素与 matchlist 中的每个向量进行比较,提取共同的值并匹配 [=16= 中的值] 以相同的顺序,并将它们存储在另一个列表中。期望的结果应如下所示:

$PP
[1] "OMITTED"

$IN01
[1] "OMITTED"

$RD1
[1] "NOT REACHED" "INVALID" "OMITTED"

$RD2
character(0)

$LOS
[1] "LOGICALLY NOT APPLICABLE" "OMITTED"

$COM
character(0)

$VR1
[1] "OMITTED"

$INF
[1] "NOT APPLICABLE" "OMITTED"

$IST
[1] "LOGICALLY NOT APPLICABLE" "OMITTED"

$CMP
[1] "LOGICALLY NOT APPLICABLE" "OMITTED"

到目前为止我尝试了什么:

使用intersect

lapply(mylist, function(i) {
  intersect(i, lapply(matchlist, function(i) {i}))
})

它returns只是matchlist("OMITTED")的每个向量中的最后一个值。

使用 match%in%:

lapply(mylist, function(i) {
  i[which(i %in% matchlist)]
})

Returns 想要的结果只有 RD1 ("INVALID", "OMITTED"),其余的 returns 只是最后一个值 ("OMITTED"),除了 COM 是正确的。

使用 mapplyintersect:

mapply(intersect, mylist, matchlist)

Returns 一个长长的列表,几乎包含所有内容,包括不应该出现的组合,以及对长度不等的警告。

有人可以帮忙吗?

这是一个使用 unlistmatchlist 的简单解决方案:

lapply(mylist, function(x) x[x %in% unlist(matchlist)])

输出(新列表):

$PP
[1] "OMITTED"

$IN01
[1] "OMITTED"

$RD1
[1] "NOT REACHED" "INVALID"     "OMITTED"    

$LOS
[1] "LOGICALLY NOT APPLICABLE" "OMITTED"                 

$COM
character(0)

$VR1
[1] "OMITTED"

$INF
[1] "NOT APPLICABLE" "OMITTED"       

$IST
[1] "LOGICALLY NOT APPLICABLE" "OMITTED"                 

$CMP
[1] "LOGICALLY NOT APPLICABLE" "OMITTED"                 

简单写

lapply(mylist, intersect, unlist(matchlist))

也有效。

lapply(mylist, function(i) {
  unlist(sapply(i,function(x){if(any(grepl(paste0("^",x,"$"),matchlist))){x}}))
})

我在字符串前后添加了“\b”,因为"NO"可以找到"NOT"。正如其他答案所示,使用 grepl 肯定不是最好的方法:)

确实有一些 simple/good 答案,但它们似乎都依赖于 unlist。我假设您需要保留 matchlist 内的分组,因此取消列出它们没有意义。这是一个没有它的解决方案,在您开始执行时使用双 lapply 循环:

out <- lapply(mylist, function(this) {
  mtch <- lapply(matchlist, intersect, this)
  wh <- which.max(lengths(mtch))
  if (length(wh)) mtch[[wh]] else character(0)
})
str(out)
# List of 9
#  $ PP  : chr "OMITTED"
#  $ IN01: chr "OMITTED"
#  $ RD1 : chr [1:3] "NOT REACHED" "INVALID" "OMITTED"
#  $ LOS : chr [1:2] "LOGICALLY NOT APPLICABLE" "OMITTED"
#  $ COM : chr(0) 
#  $ VR1 : chr "OMITTED"
#  $ INF : chr [1:2] "NOT APPLICABLE" "OMITTED"
#  $ IST : chr [1:2] "LOGICALLY NOT APPLICABLE" "OMITTED"
#  $ CMP : chr [1:2] "LOGICALLY NOT APPLICABLE" "OMITTED"

它总是 return 是匹配次数最多的向量,但如果(以某种方式)超过一个,我认为它将保留自然顺序和 return 所说的第一个long-matches。 (这里的问题是:"does which.max preserve natural order?"我觉得是,但是没有验证。)

更新

添加了约束,不仅需要 matchlist 向量的存在和顺序,而且没有插入的单词。例如,如果按照评论中的建议,mylist$RD1"BLAH",那么它将不再与 matchlist[[5]].

匹配

检查一个向量的 perfectly-ordered 子集到另一个向量的问题有点多(因此不是 code-golf 冠军),并且通常扩展性很差,因为我们没有简单的子集确定.有了这个警告,这个实现做了一些嵌套的 *apply 函数 ...

(注意:有人在评论中建议 $RD1 应该 return character(0),但它确实有 "INVALID" 匹配 single-length matchlist 的组成部分,所以它应该匹配,而不是较长的那个。)

out <- lapply(mylist, function(this) {
  ind <- lapply(matchlist, function(a) which(this == a[1]))
  perfectmatches <- mapply(function(ml, allis, this) {
    length(ml) * any(sapply(allis, function(i) all(ml == this[ i + seq_along(ml) - 1 ])))
  }, matchlist, ind, MoreArgs = list(this=this))
  if (any(perfectmatches) > 0) {
    wh <- which.max(perfectmatches)
    return(matchlist[[wh]])
  } else return(character(0))
})
str(out)
# List of 9
#  $ PP  : chr "OMITTED"
#  $ IN01: chr "OMITTED"
#  $ RD1 : chr "INVALID"
#  $ LOS : chr [1:2] "LOGICALLY NOT APPLICABLE" "OMITTED"
#  $ COM : chr(0) 
#  $ VR1 : chr "OMITTED"
#  $ INF : chr [1:2] "NOT APPLICABLE" "OMITTED"
#  $ IST : chr [1:2] "LOGICALLY NOT APPLICABLE" "OMITTED"
#  $ CMP : chr [1:2] "LOGICALLY NOT APPLICABLE" "OMITTED"