匹配允许部分匹配的字符串,但仅当存在唯一匹配时

Match strings with partial matching allowed but only when there's a unique match

我有一个向量 names 和另一个向量 v 我需要与 names 匹配。我想接收 names 的索引,其中 v 匹配。应该允许部分匹配,但前提是部分匹配是唯一的。

以下示例涵盖了所有相关案例:

names <- c("a", "b", "c", "ab", "def", "defg", "hij")

v1 <- c("a", "b")
v2 <- c("a", "ab")
v3 <- c("d")
v4 <- c("h")
v5 <- c("a", "b", "a")

我希望得到以下输出:

match_names(v1, names)
# c(1, 2)
match_names(v2, names)
# c(1, 4)
match_names(v3, names)
# error
match_names(v4, names)
# 7
match_names(v5, names)
# c(1, 2, 1)

如何编写这样的函数?我考虑过 whichgrep 的(组合),但直到现在还没有找到有用的东西?


我试过的

(之前不知道部分匹配的要求..)

match_names1 <- function(v, names) {
  sapply(v, function(i) which(i == names))
}

这适用于示例 v1v2v5


得到部分匹配的要求后

match_names2 <- function(v, names) {
  sapply(v, function(i) grep(i, names))
}

..这当然只适用于 v4


要捕获 v3,请使用 match_names1 的以下扩展名:

match_names3 <- function(v, names) {
  exact <- match_names1(v, names)
  assertthat::assert_that(class(exact) != "list")
  return(exact)
}

所以这包括 v1v2v3v5,但不包括 v4


提前感谢任何提示。

我选择了列表,因为从你的问题来看,如果来自 v 的字符串在 names 中精确匹配不止一次,那么会发生什么是不清楚的,并且保持所有完全匹配的唯一方法是到 return 列表。如果您不喜欢这些列表,您可以简单地 unlist() 结果。

match_names <- function(v, names){
    # check exact matches:
    resList <- lapply(v, function(elt) which(names == elt))

    notMatched <- which(lengths(resList) == 0) 
    if (length(notMatched) == 0) return (resList)

    #partial matching
    else{
        resNotMatched <- lapply(v[notMatched], grep, x = names)
        matchedOnce <- which(lengths(resNotMatched) == 1) 
    }

    resList[notMatched[matchedOnce]] <- resNotMatched[matchedOnce]
    return (resList)
}
> match_names(v1, names)
[[1]]
[1] 1

[[2]]
[1] 2

> # c(1, 2)
> match_names(v2, names)
[[1]]
[1] 1

[[2]]
[1] 4

> # c(1, 4)
> match_names(v3, names)
[[1]]
integer(0)

> # error
> match_names(v4, names)
[[1]]
[1] 7

> # 7
> match_names(v5, names)
[[1]]
[1] 1

[[2]]
[1] 2

[[3]]
[1] 1

> 

[[ 可用于一次部分匹配一个名称:

f = function(v){
  sapply(v, function(x) setNames(seq_along(names), names)[[x, exact=FALSE]])
}

# try it on the example
vs = list(v1,v2,v3,v4,v5)
for (i in seq_along(vs)){
  cat("\nv", i, ":\n", sep="")
  print(try( f(vs[[i]]) ))
}

产生

v1:
a b 
1 2 

v2:
 a ab 
 1  4 

v3:
Error in setNames(seq_along(names), names)[[x, exact = FALSE]] : 
  subscript out of bounds
[1] "Error in setNames(seq_along(names), names)[[x, exact = FALSE]] : \n  subscript out of bounds\n"
attr(,"class")
[1] "try-error"
attr(,"condition")
<simpleError in setNames(seq_along(names), names)[[x, exact = FALSE]]: subscript out of bounds>

v4:
h 
7 

v5:
a b a 
1 2 1 

match 函数应该适用于除部分匹配 v4 之外的所有情况。 为了迎合部分匹配,您可以定义一个类似以下的函数:

match_names <- function(v, names) {

  ind <- match(v, names)

  # If can't find the match then try partial matching
  if (any(is.na(ind))) { 

    # grepl to find partial matching index
    ind <- which(grepl(v, names))

    # To ensure partial matched value is unique.
    if (length(ind) > 1) ind <- NA

  }

  return(ind)

}

> match_names(v1, names)
[1] 1 2
> match_names(v2, names)
[1] 1 4
> match_names(v3, names)
[1] NA
> match_names(v4, names)
[1] 7
> match_names(v5, names)
[1] 1 2 1