使用 match() 或类似方法在数据帧之间进行部分字符串匹配以保留匹配位置

Partial string matching between data frames with match() or similar to preserve match positions

使用函数 match() 我想在不同数据帧的两个字符向量之间执行部分字符串匹配。 必须保留匹配值的位置,因为它稍后会用于引用相邻的列,我发现函数 match() 最适合这个。

我可以进行精确的字符串匹配:

## exact string matching
name <-  c("AAB", "AAC", "AAD","AAE")
meaning1 <- c('circular','parallel','perpendicular','none') 
meaning2 <- c('surface','longitudinal','transverse','not detected') 
meaning3 <- c('category 1','category 1','category 1','category 2') 
referenceData <- data.frame(name, meaning1, meaning2, meaning3, stringsAsFactors = FALSE)
name2 <- c("AAB", "AAC", "AAD","AAE")
myData <- data.frame(name2, stringsAsFactors = FALSE)
> referenceData
  name      meaning1     meaning2   meaning3
1  AAB      circular      surface category 1
2  AAC      parallel longitudinal category 1
3  AAD perpendicular   transverse category 1
4  AAE          none not detected category 2
> myData 
  name2
1   AAB
2   AAC
3   AAD
4   AAE

matched <- match(myData[ , 'name2'],  referenceData[ ,'name'])
> matched
[1] 1 2 3 4

myData$newCol <- referenceData$meaning1[matched]
myData$newCol2 <- referenceData$meaning2[matched]
> myData
  name2        newCol      newCol2
1   AAB      circular      surface
2   AAC      parallel longitudinal
3   AAD perpendicular   transverse
4   AAE          none not detected

但是真实数据有点复杂,只能部分匹配,所以我上面的方法行不通:

name2 <- c("AAB Monday and Thursday", "AAC Saturday", "AAD Wednesday", "AAE Friday")
myData <- data.frame(name2, stringsAsFactors = FALSE)
> myData 
                    name2
1 AAB Monday and Thursday
2            AAC Saturday
3           AAD Wednesday
4              AAE Friday

 matched <- match(myData[ , 'name2'],  referenceData[ ,'name'])
> matched
[1] NA NA NA NA

myData$newCol <- referenceData$meaning1[matched]
myData$newCol2 <- referenceData$meaning2[matched]
> myData
                    name2 newCol newCol2
1 AAB Monday and Thursday   <NA>    <NA>
2            AAC Saturday   <NA>    <NA>
3           AAD Wednesday   <NA>    <NA>
4              AAE Friday   <NA>    <NA>

可以将 match() 与正则表达式结合起来进行部分匹配吗?

编辑 可重现的例子过于简单化了。比较有代表性的内容应该是:

name2 <- c("AAB Monday and Thursday", "AAC Saturday", "AAD Wednesday", "AAE Friday","AAB Monday and Thursday","AAB Monday and Thursday")
myData <- data.frame(name2, stringsAsFactors = FALSE)
> myData
                    name2
1 AAB Monday and Thursday
2            AAC Saturday
3           AAD Wednesday
4              AAE Friday
5 AAB Monday and Thursday
6 AAB Monday and Thursday

您可以像这样使用 sapply 和 grep:

sapply(referenceData[, 'name'], grep, myData[, 'name2'])

请注意,我颠倒了参数的顺序。 "AAB" 作为正则表达式匹配 "AAB Monday and Thursday",但反之则不然

编辑:鉴于您的编辑,如果您知道您总是只匹配前三个字符,您可以尝试这种简单的方法(不需要部分匹配):

first3 <- substr(myData[ , 'name2'],  1, 3)
match(first3,  referenceData[ ,'name'])