使用 match() 或类似方法在数据帧之间进行部分字符串匹配以保留匹配位置
Partial string matching between data frames with match() or similar to preserve match positions
使用函数 match() 我想在不同数据帧的两个字符向量之间执行部分字符串匹配。
必须保留匹配值的位置,因为它稍后会用于引用相邻的列,我发现函数 match() 最适合这个。
我可以进行精确的字符串匹配:
## exact string matching
name <- c("AAB", "AAC", "AAD","AAE")
meaning1 <- c('circular','parallel','perpendicular','none')
meaning2 <- c('surface','longitudinal','transverse','not detected')
meaning3 <- c('category 1','category 1','category 1','category 2')
referenceData <- data.frame(name, meaning1, meaning2, meaning3, stringsAsFactors = FALSE)
name2 <- c("AAB", "AAC", "AAD","AAE")
myData <- data.frame(name2, stringsAsFactors = FALSE)
> referenceData
name meaning1 meaning2 meaning3
1 AAB circular surface category 1
2 AAC parallel longitudinal category 1
3 AAD perpendicular transverse category 1
4 AAE none not detected category 2
> myData
name2
1 AAB
2 AAC
3 AAD
4 AAE
matched <- match(myData[ , 'name2'], referenceData[ ,'name'])
> matched
[1] 1 2 3 4
myData$newCol <- referenceData$meaning1[matched]
myData$newCol2 <- referenceData$meaning2[matched]
> myData
name2 newCol newCol2
1 AAB circular surface
2 AAC parallel longitudinal
3 AAD perpendicular transverse
4 AAE none not detected
但是真实数据有点复杂,只能部分匹配,所以我上面的方法行不通:
name2 <- c("AAB Monday and Thursday", "AAC Saturday", "AAD Wednesday", "AAE Friday")
myData <- data.frame(name2, stringsAsFactors = FALSE)
> myData
name2
1 AAB Monday and Thursday
2 AAC Saturday
3 AAD Wednesday
4 AAE Friday
matched <- match(myData[ , 'name2'], referenceData[ ,'name'])
> matched
[1] NA NA NA NA
myData$newCol <- referenceData$meaning1[matched]
myData$newCol2 <- referenceData$meaning2[matched]
> myData
name2 newCol newCol2
1 AAB Monday and Thursday <NA> <NA>
2 AAC Saturday <NA> <NA>
3 AAD Wednesday <NA> <NA>
4 AAE Friday <NA> <NA>
可以将 match() 与正则表达式结合起来进行部分匹配吗?
编辑
可重现的例子过于简单化了。比较有代表性的内容应该是:
name2 <- c("AAB Monday and Thursday", "AAC Saturday", "AAD Wednesday", "AAE Friday","AAB Monday and Thursday","AAB Monday and Thursday")
myData <- data.frame(name2, stringsAsFactors = FALSE)
> myData
name2
1 AAB Monday and Thursday
2 AAC Saturday
3 AAD Wednesday
4 AAE Friday
5 AAB Monday and Thursday
6 AAB Monday and Thursday
您可以像这样使用 sapply 和 grep:
sapply(referenceData[, 'name'], grep, myData[, 'name2'])
请注意,我颠倒了参数的顺序。 "AAB" 作为正则表达式匹配 "AAB Monday and Thursday",但反之则不然
编辑:鉴于您的编辑,如果您知道您总是只匹配前三个字符,您可以尝试这种简单的方法(不需要部分匹配):
first3 <- substr(myData[ , 'name2'], 1, 3)
match(first3, referenceData[ ,'name'])
使用函数 match() 我想在不同数据帧的两个字符向量之间执行部分字符串匹配。 必须保留匹配值的位置,因为它稍后会用于引用相邻的列,我发现函数 match() 最适合这个。
我可以进行精确的字符串匹配:
## exact string matching
name <- c("AAB", "AAC", "AAD","AAE")
meaning1 <- c('circular','parallel','perpendicular','none')
meaning2 <- c('surface','longitudinal','transverse','not detected')
meaning3 <- c('category 1','category 1','category 1','category 2')
referenceData <- data.frame(name, meaning1, meaning2, meaning3, stringsAsFactors = FALSE)
name2 <- c("AAB", "AAC", "AAD","AAE")
myData <- data.frame(name2, stringsAsFactors = FALSE)
> referenceData
name meaning1 meaning2 meaning3
1 AAB circular surface category 1
2 AAC parallel longitudinal category 1
3 AAD perpendicular transverse category 1
4 AAE none not detected category 2
> myData
name2
1 AAB
2 AAC
3 AAD
4 AAE
matched <- match(myData[ , 'name2'], referenceData[ ,'name'])
> matched
[1] 1 2 3 4
myData$newCol <- referenceData$meaning1[matched]
myData$newCol2 <- referenceData$meaning2[matched]
> myData
name2 newCol newCol2
1 AAB circular surface
2 AAC parallel longitudinal
3 AAD perpendicular transverse
4 AAE none not detected
但是真实数据有点复杂,只能部分匹配,所以我上面的方法行不通:
name2 <- c("AAB Monday and Thursday", "AAC Saturday", "AAD Wednesday", "AAE Friday")
myData <- data.frame(name2, stringsAsFactors = FALSE)
> myData
name2
1 AAB Monday and Thursday
2 AAC Saturday
3 AAD Wednesday
4 AAE Friday
matched <- match(myData[ , 'name2'], referenceData[ ,'name'])
> matched
[1] NA NA NA NA
myData$newCol <- referenceData$meaning1[matched]
myData$newCol2 <- referenceData$meaning2[matched]
> myData
name2 newCol newCol2
1 AAB Monday and Thursday <NA> <NA>
2 AAC Saturday <NA> <NA>
3 AAD Wednesday <NA> <NA>
4 AAE Friday <NA> <NA>
可以将 match() 与正则表达式结合起来进行部分匹配吗?
编辑 可重现的例子过于简单化了。比较有代表性的内容应该是:
name2 <- c("AAB Monday and Thursday", "AAC Saturday", "AAD Wednesday", "AAE Friday","AAB Monday and Thursday","AAB Monday and Thursday")
myData <- data.frame(name2, stringsAsFactors = FALSE)
> myData
name2
1 AAB Monday and Thursday
2 AAC Saturday
3 AAD Wednesday
4 AAE Friday
5 AAB Monday and Thursday
6 AAB Monday and Thursday
您可以像这样使用 sapply 和 grep:
sapply(referenceData[, 'name'], grep, myData[, 'name2'])
请注意,我颠倒了参数的顺序。 "AAB" 作为正则表达式匹配 "AAB Monday and Thursday",但反之则不然
编辑:鉴于您的编辑,如果您知道您总是只匹配前三个字符,您可以尝试这种简单的方法(不需要部分匹配):
first3 <- substr(myData[ , 'name2'], 1, 3)
match(first3, referenceData[ ,'name'])