按匹配项的数量有效地对字符串匹配进行排名

Question

总结：如何最有效地计算多个正则表达式匹配项并按发生率对结果进行排名？是否应该使用语义方法代替正则表达式？

用于说明的样本数据：

sample_string <- c("Total - Main mode of commuting for the employed labour force aged 15 years and over in private households with a usual place of work or no fixed workplace address - 25% sample data", 
"Total - Language used most often at work for the population in private households aged 15 years and over who worked since January 1, 2015 - 25% sample data", 
"Number of market income recipients aged 15 years and over in private households - 25% sample data", 
"Number of employment income recipients aged 15 years and over in private households", 
"Total - Major field of study - Classification of Instructional Programs (CIP) 2016 for the population aged 15 years and over in private households - 25% sample data", 
"Total - Selected places of birth for the recent immigrant population in private households - 25% sample data", 
"Total - Commuting duration for the employed labour force aged 15 years and over in private households with a usual place of work or no fixed workplace address - 25% sample data", 
"Number of market income recipients aged 15 years and over in private households", 
"Employment income (%)", "Total - Aboriginal ancestry for the population in private households - 25% sample data", 
"Without employment income", "With after-tax income", "1 household maintainer", 
"Spending 30% or more of income on shelter costs", "Total - Highest certificate, diploma or degree for the population aged 25 to 64 years in private households - 25% sample data"
)

以及包含多个术语的示例字符串查询

sample_query <- c("after tax income")

使用 grepl 可以很容易地检查字符串查询的匹配项。

sample_string[grepl(sample_query, sample_string)]

但显然这在这里行不通，因为没有完全匹配，因为实际术语是 after-tax income。另一种方法是将搜索查询分成几个部分并对其进行检查。

sample_string[grepl(paste(unlist(strsplit(sample_query, " +")), collapse = "|"), sample_string)]

这可行，但会 return 太多结果，因为它与任何这些术语的任何实例相匹配。

[1] "Number of market income recipients aged 15 years and over in private households - 25% sample data"
[2] "Number of employment income recipients aged 15 years and over in private households"              
[3] "Number of market income recipients aged 15 years and over in private households"                  
[4] "Employment income (%)"                                                                            
[5] "Without employment income"                                                                        
[6] "With after-tax income"                                                                            
[7] "Spending 30% or more of income on shelter costs"

问题：如何有效地return根据单个匹配的数量进行最接近的匹配？

应用一些答案 here，并添加排序和匹配会导致畸形：

sample_string[grepl(paste(unlist(strsplit(sample_query, " +")),
                          collapse = "|"),
                    sample_string)][order(-lengths(regmatches(
                      sample_string[grepl(paste(unlist(strsplit(sample_query, " +")),
                                                collapse = "|"),
                                          sample_string)],
                      gregexpr(paste(unlist(
                        strsplit(sample_query, " +")
                      ),
                      collapse = "|"),
                      sample_string[grepl(paste(unlist(strsplit(sample_query, " +")),
                                                collapse = "|"),
                                          sample_string)])
                    )))]

return是我想要的 - 至少有一个匹配项的所有字符串列表，按匹配项数排序。

[1] "With after-tax income"                                                                            
[2] "Number of market income recipients aged 15 years and over in private households - 25% sample data"
[3] "Number of employment income recipients aged 15 years and over in private households"              
[4] "Number of market income recipients aged 15 years and over in private households"                  
[5] "Employment income (%)"                                                                            
[6] "Without employment income"                                                                        
[7] "Spending 30% or more of income on shelter costs"

稍微清理一下上面的怪物：

to_match <- paste(unlist(strsplit(sample_query, " +")),collapse = "|")
results <- sample_string[grepl(to_match,sample_string)]
results[order(-lengths(regmatches(results,gregexpr(to_match,results))))]

我可以接受这个，但是有没有办法让它更简洁？而且，我想知道这是否是解决此问题的最佳方法？

我知道 stringr::str_count 和 stringi::stri_count_regex。这是一个包，我试图避免添加额外的依赖项，但如果这些更有效，我可以改用它。

或者，字符串距离替代方案是否是更好的选择？检查数千个长字符串时是否更好？

目的是帮助用户找到相关信息，也许有一些更面向语义的东西会有意义。

Answer 1

我相信这可以改进，但这是使用 Levenshtein Distance 后的一种方法：

# Desired query scalar: actual_query => character vector
actual_query <- "after tax income"

# Separate words in query: query_words => character vector: 
query_words <- unlist(strsplit(tolower(actual_query), "[^a-z]+"))

# Calculate n (scalar) for n-grams: word_count => integer vector
word_count <- length(query_words)

# Split each word preserving any non-character values: 
# sentence_word_split => character vector
sentence_word_split <- strsplit(tolower(sample_string), "\s+")

# Split original sentences into n-grams (relative to query length): 
# n_grams => list 
n_grams <- lapply(sentence_word_split, function(x){
              sapply(seq_along(x), function(i){
                paste(x[i:min(length(x), ((i+word_count)-1))], sep = " ", collapse = " ")
      }
    )
  }
)

# Rank ngrams based on the frequency of their occurence in sample string: 
# ordered_n_gram => character vector
ordered_ngram_count <- trimws(names(sort(table(unlist(n_grams)), decreasing = TRUE)), "both")

# Combine the query with each of its elements: revised_query => character vector 
revised_query <- c(actual_query, unlist(strsplit(actual_query, "\s+")))

# Use levenshtein distance to determine similarity of revised_query 
# to the expressions in the ordered_ngram_count: lev_dist_df => data.frame
lev_dist_df <- setNames(data.frame(sapply(seq_along(revised_query), 
                    function(i){
                       adist(revised_query[i], ordered_ngram_count)
                      }
                    )), gsub("\s+", "_", revised_query))

# Example of applying function returning string element in sample string 
# with the minimum edit distance: sample_string element => stdout (console)
grep(grep(ordered_ngram_count[sapply(seq_along(ncol(lev_dist_df)),
                                     function(i) {
                                       which.min(lev_dist_df[, i])
                                     })], sample_string,
          value = TRUE),
     sample_string,
     value = TRUE)

以上的更清洁版本：

# Desired query scalar: sample_query => character vector
sample_query <- "after tax income"

# Separate words in query: query_words => character vector: 
query_words <- unlist(strsplit(tolower(sample_query), "[^a-z]+"))

# Calculate n (scalar) for n-grams: word_count => integer vector
word_count <- length(query_words)

# Split each word preserving any non-character values: 
# sentence_word_split => character vector
sentence_word_split <- strsplit(tolower(sample_string), "\s+")

# Split original sentences into n-grams (relative to query length): 
# n_grams => list 
n_grams <- lapply(sentence_word_split, function(x){
    sapply(seq_along(x), function(i){
      paste(x[i:min(length(x), ((i+word_count)-1))], sep = " ", collapse = " ")
      }
    )
  }
)

# Rank ngrams based on the frequency of their occurence in sample string: 
# ordered_n_gram => character vector
ordered_ngram_count <- trimws(names(sort(table(unlist(n_grams)), decreasing = TRUE)), "both")

# Use levenshtein distance to determine similarity of revised_query 
# to the expressions in the ordered_ngram_count: lev_dist_df => data.frame
lev_dist_df <- setNames(data.frame(sapply(seq_along(sample_query), 
                                          function(i){
                                            adist(sample_query[i], ordered_ngram_count)
                                          })), gsub("\s+", "_", sample_query))

# Example of applying function returning string element in sample string 
# with the minimum edit distance: sample_string element => stdout (console)
grep(grep(ordered_ngram_count[which.min(lev_dist_df[,1])], sample_string,
          value = TRUE), sample_string, value = TRUE)

数据：

sample_string <- c("Total - Main mode of commuting for the employed labour force aged 15 years and over in private households with a usual place of work or no fixed workplace address - 25% sample data", 
                   "Total - Language used most often at work for the population in private households aged 15 years and over who worked since January 1, 2015 - 25% sample data", 
                   "Number of market income recipients aged 15 years and over in private households - 25% sample data", 
                   "Number of employment income recipients aged 15 years and over in private households", 
                   "Total - Major field of study - Classification of Instructional Programs (CIP) 2016 for the population aged 15 years and over in private households - 25% sample data", 
                   "Total - Selected places of birth for the recent immigrant population in private households - 25% sample data", 
                   "Total - Commuting duration for the employed labour force aged 15 years and over in private households with a usual place of work or no fixed workplace address - 25% sample data", 
                   "Number of market income recipients aged 15 years and over in private households", 
                   "Employment income (%)", "Total - Aboriginal ancestry for the population in private households - 25% sample data", 
                   "Without employment income", "With after-tax income", "1 household maintainer", 
                   "Spending 30% or more of income on shelter costs", "Total - Highest certificate, diploma or degree for the population aged 25 to 64 years in private households - 25% sample data"
)

按匹配项的数量有效地对字符串匹配进行排名

Efficiently ranking string matches by number of matched terms

regex

r

stringr

grepl