R中的部分字符串匹配和替换
Partial string matching & replacement in R
我有一个这样的数据框
> myDataFrame
company
1 Investment LLC
2 Hyperloop LLC
3 Invezzstment LLC
4 Investment_LLC
5 Haiperloop LLC
6 Inwestment LLC
我需要匹配所有这些模糊字符串,所以最终结果应该是这样的:
> myDataFrame
company
1 Investment LLC
2 Hyperloop LLC
3 Investment LLC
4 Investment LLC
5 Hyperloop LLC
6 Investment LLC
所以,实际上,我必须解决分类变量的部分匹配和替换任务。基础 R 和程序包中有很多很棒的函数可以解决字符串匹配问题,但我坚持要为这种匹配和替换找到一个单一的解决方案。
我不在乎哪个事件会取代其他事件,例如 "Investment LLC" 或 "Invezzstment LLC" 都同样好。只需要它们一致.
是否有任何单一的一体化功能或循环?
如果您有正确拼写的向量,agrep
会让这相当容易:
myDataFrame$company <- sapply(myDataFrame$company,
function(val){agrep(val,
c('Investment LLC', 'Hyperloop LLC'),
value = TRUE)})
myDataFrame
# company
# 1 Investment LLC
# 2 Hyperloop LLC
# 3 Investment LLC
# 4 Investment LLC
# 5 Hyperloop LLC
# 6 Investment LLC
如果你没有这样的向量,你可以巧妙地应用 adist
或者甚至只是 table
来创建一个,如果正确的拼写比其他的重复得更多,它可能会(虽然不在这里)。
所以,一段时间后我得到了这个愚蠢的代码。 注意:不是完全自动化替换过程,因为每次正确的匹配都应该由人工验证,每次我们都需要一个微调 agrep max.distance
参数。我完全相信有办法让它变得更好更快,但这有助于完成工作。
##########
# Manual renaming with partial matches
##########
# a) Take a look at the desired column of factor variables
sort(unique(MYDATA$names)) # take a look
# ****
Sensthreshold <- 0.2 # sensitivity of agrep, usually 0.1-0.2 get it right
Searchstring <- "Invesstment LLC" # what should I search?
# ****
# User-defined function: returns similar string on query in column
Searcher <- function(input, similarity = 0.1) {
unique(agrep(input,
MYDATA$names, # <-- define your column here
ignore.case = TRUE, value = TRUE,
max.distance = similarity))
}
# b) Make a search of desired string
Searcher(Searchstring, Sensthreshold) # using user-def function
### PLEASE INSPECT THE OUTPUT OF THE SEARCH
### Did it get it right?
=============================================================================#
## ACTION! This changes your dataframe!
## Please make backup before proceeding
## Please execute this code as a whole to avoid errors
# c) Make a vector of cells indexes after checking output
vector_of_cells <- agrep(Searchstring,
MYDATA$names, ignore.case = TRUE,
max.distance = Sensthreshold)
# d) Apply the changes
MYDATA$names[vector_of_cells] <- Searchstring # <--- CHANGING STRING
# e) Check result
unique(agrep(Searchstring, MYDATA$names,
ignore.case = TRUE, value = TRUE, max.distance = Sensthreshold))
=============================================================================#
我有一个这样的数据框
> myDataFrame
company
1 Investment LLC
2 Hyperloop LLC
3 Invezzstment LLC
4 Investment_LLC
5 Haiperloop LLC
6 Inwestment LLC
我需要匹配所有这些模糊字符串,所以最终结果应该是这样的:
> myDataFrame
company
1 Investment LLC
2 Hyperloop LLC
3 Investment LLC
4 Investment LLC
5 Hyperloop LLC
6 Investment LLC
所以,实际上,我必须解决分类变量的部分匹配和替换任务。基础 R 和程序包中有很多很棒的函数可以解决字符串匹配问题,但我坚持要为这种匹配和替换找到一个单一的解决方案。 我不在乎哪个事件会取代其他事件,例如 "Investment LLC" 或 "Invezzstment LLC" 都同样好。只需要它们一致.
是否有任何单一的一体化功能或循环?
如果您有正确拼写的向量,agrep
会让这相当容易:
myDataFrame$company <- sapply(myDataFrame$company,
function(val){agrep(val,
c('Investment LLC', 'Hyperloop LLC'),
value = TRUE)})
myDataFrame
# company
# 1 Investment LLC
# 2 Hyperloop LLC
# 3 Investment LLC
# 4 Investment LLC
# 5 Hyperloop LLC
# 6 Investment LLC
如果你没有这样的向量,你可以巧妙地应用 adist
或者甚至只是 table
来创建一个,如果正确的拼写比其他的重复得更多,它可能会(虽然不在这里)。
所以,一段时间后我得到了这个愚蠢的代码。 注意:不是完全自动化替换过程,因为每次正确的匹配都应该由人工验证,每次我们都需要一个微调 agrep max.distance
参数。我完全相信有办法让它变得更好更快,但这有助于完成工作。
##########
# Manual renaming with partial matches
##########
# a) Take a look at the desired column of factor variables
sort(unique(MYDATA$names)) # take a look
# ****
Sensthreshold <- 0.2 # sensitivity of agrep, usually 0.1-0.2 get it right
Searchstring <- "Invesstment LLC" # what should I search?
# ****
# User-defined function: returns similar string on query in column
Searcher <- function(input, similarity = 0.1) {
unique(agrep(input,
MYDATA$names, # <-- define your column here
ignore.case = TRUE, value = TRUE,
max.distance = similarity))
}
# b) Make a search of desired string
Searcher(Searchstring, Sensthreshold) # using user-def function
### PLEASE INSPECT THE OUTPUT OF THE SEARCH
### Did it get it right?
=============================================================================#
## ACTION! This changes your dataframe!
## Please make backup before proceeding
## Please execute this code as a whole to avoid errors
# c) Make a vector of cells indexes after checking output
vector_of_cells <- agrep(Searchstring,
MYDATA$names, ignore.case = TRUE,
max.distance = Sensthreshold)
# d) Apply the changes
MYDATA$names[vector_of_cells] <- Searchstring # <--- CHANGING STRING
# e) Check result
unique(agrep(Searchstring, MYDATA$names,
ignore.case = TRUE, value = TRUE, max.distance = Sensthreshold))
=============================================================================#