R中的部分字符串匹配和替换

Partial string matching & replacement in R

我有一个这样的数据框

> myDataFrame
           company
1   Investment LLC
2    Hyperloop LLC
3 Invezzstment LLC
4   Investment_LLC
5   Haiperloop LLC
6   Inwestment LLC

我需要匹配所有这些模糊字符串,所以最终结果应该是这样的:

> myDataFrame
           company
1   Investment LLC
2    Hyperloop LLC
3   Investment LLC
4   Investment LLC
5    Hyperloop LLC
6   Investment LLC

所以,实际上,我必须解决分类变量的部分匹配和替换任务。基础 R 和程序包中有很多很棒的函数可以解决字符串匹配问题,但我坚持要为这种匹配和替换找到一个单一的解决方案。 我不在乎哪个事件会取代其他事件,例如 "Investment LLC" 或 "Invezzstment LLC" 都同样好。只需要它们一致.

是否有任何单一的一体化功能或循环?

如果您有正确拼写的向量,agrep 会让这相当容易:

myDataFrame$company <- sapply(myDataFrame$company, 
                              function(val){agrep(val, 
                                                  c('Investment LLC', 'Hyperloop LLC'), 
                                                  value = TRUE)})

myDataFrame
#          company
# 1 Investment LLC
# 2  Hyperloop LLC
# 3 Investment LLC
# 4 Investment LLC
# 5  Hyperloop LLC
# 6 Investment LLC

如果你没有这样的向量,你可以巧妙地应用 adist 或者甚至只是 table 来创建一个,如果正确的拼写比其他的重复得更多,它可能会(虽然不在这里)。

所以,一段时间后我得到了这个愚蠢的代码。 注意不是完全自动化替换过程,因为每次正确的匹配都应该由人工验证,每次我们都需要一个微调 agrep max.distance 参数。我完全相信有办法让它变得更好更快,但这有助于完成工作。

    ##########
    # Manual renaming with partial matches
    ##########

    # a) Take a look at the desired column of factor variables
    sort(unique(MYDATA$names))   # take a look

    # ****
    Sensthreshold <- 0.2   # sensitivity of agrep, usually 0.1-0.2 get it right
    Searchstring <- "Invesstment LLC"   # what should I search?
    # ****

    # User-defined function: returns similar string on query in column
    Searcher <- function(input, similarity = 0.1) {
      unique(agrep(input, 
                   MYDATA$names,   # <-- define your column here
                   ignore.case = TRUE, value = TRUE,
                   max.distance = similarity))
    }

    # b) Make a search of desired string
    Searcher(Searchstring, Sensthreshold)   # using user-def function 
    ### PLEASE INSPECT THE OUTPUT OF THE SEARCH
    ### Did it get it right?

 =============================================================================#
    ## ACTION! This changes your dataframe!
    ## Please make backup before proceeding
    ## Please execute this code as a whole to avoid errors

    # c) Make a vector of cells indexes after checking output
    vector_of_cells <- agrep(Searchstring, 
                       MYDATA$names, ignore.case = TRUE,
                       max.distance = Sensthreshold)
    # d) Apply the changes
    MYDATA$names[vector_of_cells] <- Searchstring # <--- CHANGING STRING
    # e) Check result
    unique(agrep(Searchstring, MYDATA$names, 
                 ignore.case = TRUE, value = TRUE, max.distance = Sensthreshold))
=============================================================================#