使用自定义字典模糊匹配和替换数据框中的字符串
Fuzzy Match and replace strings in dataframe using custom dictionnary
我有这个数据框有相似之处(语法差异很小的字符串)
place1 <- c("pondichery ", "Pondichery", "Pondichéry", "Port-Louis", "Port Louis ")
place2 <- c("Lorent", "Pondichery", " Lorient", "port-louis", "Port Louis")
place3 <- c("Loirent", "Pondchéry", "Brest", "Port Louis", "Nantes")
places2clean <- data.frame(place1, place2, place3)
这是我的自定义词典
dictionnary <- c("Pondichéry", "Lorient", "Port-Louis", "Nantes", "Brest")
dictionnary <- data.frame(dictionnary)
我想根据自定义词典匹配和替换所有字符串。
预期结果:
place1 place2 place3
Pondichéry Lorient Lorient
Pondichéry Pondichéry Pondichéry
Pondichéry Lorient Brest
Port-Louis Port-Louis Port Louis
Port-Louis Port-Louis Nantes
如何使用 stringdistance 来匹配和替换所有数据帧?
基本 R 函数 adist
或 stringdist::amatch
函数在这里都有用。没有理由把你的字典做成data.frame
,所以我这里没有。
如果您想进行试验,您可以对 stringdist 包使用不同的方法,尽管这里的默认方法可以正常工作。请注意,对于这两个函数,都会选择最佳匹配,但如果没有紧密匹配(由 maxDist 参数定义),则返回 NA。
library(stringdist)
# Using stringdist package
clean_places <- function(places, dictionary, maxDist = 5) {
dictionary[amatch(places, dictionary, maxDist = maxDist)]
}
# Using base R
clean_places2 <- function(places, dictionary, maxDist = 5) {
sm <- adist(places, dictionary)
sm[sm > maxDist] <- NA
dictionary[apply(sm, 1, which.min)]
}
dictionary <- c("Pondichéry", "Lorient", "Port-Louis", "Nantes", "Brest")
place1 <- c("pondichery ", "Pondichery", "Pondichéry", "Port-Louis", "Port Louis ")
place2 <- c("Lorent", "Pondichery", " Lorient", "port-louis", "Port Louis")
place3 <- c("Loirent", "Pondchéry", "Brest", "Port Louis", "Nantes")
clean_places(place1, dictionary)
# [1] "Pondichéry" "Pondichéry" "Pondichéry" "Port-Louis" "Port-Louis"
clean_places(place2, dictionary)
# [1] "Lorient" "Pondichéry" "Lorient" "Port-Louis" "Port-Louis"
clean_places(place3, dictionary)
# [1] "Lorient" "Pondichéry" "Brest" "Port-Louis" "Nantes"
clean_places2(place1, dictionary)
# [1] "Pondichéry" "Pondichéry" "Pondichéry" "Port-Louis" "Port-Louis"
clean_places2(place2, dictionary)
# [1] "Lorient" "Pondichéry" "Lorient" "Port-Louis" "Port-Louis"
clean_places2(place3, dictionary)
# [1] "Lorient" "Pondichéry" "Brest" "Port-Louis" "Nantes"
下面先计算每列与字典的距离矩阵,然后得到距离较小的字符串
library(stringdist)
places2clean[] <- lapply(places2clean, trimws)
d <- lapply(places2clean, function(x) {
sapply(dictionnary$dictionnary, function(y) stringdist(x, y))
})
res <- sapply(d, function(x){
inx <- apply(x, 1, which.min)
dictionnary$dictionnary[inx]
})
as.data.frame(res)
# place1 place2 place3
#1 Pondichéry Lorient Lorient
#2 Pondichéry Pondichéry Pondichéry
#3 Pondichéry Lorient Brest
#4 Port-Louis Port-Louis Port-Louis
#5 Port-Louis Port-Louis Nantes
我有这个数据框有相似之处(语法差异很小的字符串)
place1 <- c("pondichery ", "Pondichery", "Pondichéry", "Port-Louis", "Port Louis ")
place2 <- c("Lorent", "Pondichery", " Lorient", "port-louis", "Port Louis")
place3 <- c("Loirent", "Pondchéry", "Brest", "Port Louis", "Nantes")
places2clean <- data.frame(place1, place2, place3)
这是我的自定义词典
dictionnary <- c("Pondichéry", "Lorient", "Port-Louis", "Nantes", "Brest")
dictionnary <- data.frame(dictionnary)
我想根据自定义词典匹配和替换所有字符串。
预期结果:
place1 place2 place3
Pondichéry Lorient Lorient
Pondichéry Pondichéry Pondichéry
Pondichéry Lorient Brest
Port-Louis Port-Louis Port Louis
Port-Louis Port-Louis Nantes
如何使用 stringdistance 来匹配和替换所有数据帧?
基本 R 函数 adist
或 stringdist::amatch
函数在这里都有用。没有理由把你的字典做成data.frame
,所以我这里没有。
如果您想进行试验,您可以对 stringdist 包使用不同的方法,尽管这里的默认方法可以正常工作。请注意,对于这两个函数,都会选择最佳匹配,但如果没有紧密匹配(由 maxDist 参数定义),则返回 NA。
library(stringdist)
# Using stringdist package
clean_places <- function(places, dictionary, maxDist = 5) {
dictionary[amatch(places, dictionary, maxDist = maxDist)]
}
# Using base R
clean_places2 <- function(places, dictionary, maxDist = 5) {
sm <- adist(places, dictionary)
sm[sm > maxDist] <- NA
dictionary[apply(sm, 1, which.min)]
}
dictionary <- c("Pondichéry", "Lorient", "Port-Louis", "Nantes", "Brest")
place1 <- c("pondichery ", "Pondichery", "Pondichéry", "Port-Louis", "Port Louis ")
place2 <- c("Lorent", "Pondichery", " Lorient", "port-louis", "Port Louis")
place3 <- c("Loirent", "Pondchéry", "Brest", "Port Louis", "Nantes")
clean_places(place1, dictionary)
# [1] "Pondichéry" "Pondichéry" "Pondichéry" "Port-Louis" "Port-Louis"
clean_places(place2, dictionary)
# [1] "Lorient" "Pondichéry" "Lorient" "Port-Louis" "Port-Louis"
clean_places(place3, dictionary)
# [1] "Lorient" "Pondichéry" "Brest" "Port-Louis" "Nantes"
clean_places2(place1, dictionary)
# [1] "Pondichéry" "Pondichéry" "Pondichéry" "Port-Louis" "Port-Louis"
clean_places2(place2, dictionary)
# [1] "Lorient" "Pondichéry" "Lorient" "Port-Louis" "Port-Louis"
clean_places2(place3, dictionary)
# [1] "Lorient" "Pondichéry" "Brest" "Port-Louis" "Nantes"
下面先计算每列与字典的距离矩阵,然后得到距离较小的字符串
library(stringdist)
places2clean[] <- lapply(places2clean, trimws)
d <- lapply(places2clean, function(x) {
sapply(dictionnary$dictionnary, function(y) stringdist(x, y))
})
res <- sapply(d, function(x){
inx <- apply(x, 1, which.min)
dictionnary$dictionnary[inx]
})
as.data.frame(res)
# place1 place2 place3
#1 Pondichéry Lorient Lorient
#2 Pondichéry Pondichéry Pondichéry
#3 Pondichéry Lorient Brest
#4 Port-Louis Port-Louis Port-Louis
#5 Port-Louis Port-Louis Nantes