比较 R 中的两个字符串并查看添加、删除

Compare two strings in R and see additions, deletions

我想比较 R 中的两个字符值,看看添加和删除了哪些字符以便稍后显示,类似于 git diff --color-words=.(见下面的屏幕截图)

例如:

a <- "hello world"
b <- "helo world!"

diff <- FUN(a, b)

其中 diff 会以某种方式显示删除了 l 并添加了 !

最终目标是构造一个像这样的 html 字符串 hel<span class="deleted">l</span>o world<span class="added">!</span>

我知道 diffobj,但到目前为止我无法了解 return 字符差异,只有元素之间的差异。

git diff --color-words=.

的输出

输出如下所示:

找到了使用 diffobj::ses_dat() 并将数据拆分为之前的字符的解决方案。

get_html_diff <- function(a, b) {
  aa <- strsplit(a, "")[[1]]
  bb <- strsplit(b, "")[[1]]
  s <- diffobj::ses_dat(aa, bb)
  
  m <- cumsum(as.integer(s$op) != c(Inf, s$op[1:(length(s$op) - 1)]))
  
  res <- paste(
    sapply(split(seq_along(s$op), m), function(i) {
      val <- paste(s$val[i], collapse = "")
      if (s$op[i[[1]]] == "Insert")
        val <- paste0("<span class=\"add\">", val, "</span>")
      if (s$op[i[[1]]] == "Delete")
        val <- paste0("<span class=\"del\">", val, "</span>")
      val
    }),
    collapse = "")
  res
}

get_html_diff("hello world", "helo World!")
#> [1] "hel<span class=\"del\">l</span>o <span class=\"del\">w</span><span class=\"add\">W</span>orld<span class=\"add\">!</span>"

reprex package (v2.0.1)

创建于 2022-05-31

Base R 有一个函数 adist 可以计算广义 Levenshtein 距离。使用参数 countpartial 属性 "trafos" 设置为从一个字符串到另一个字符串所需的匹配、插入和删除序列。从文档的值部分,我强调:

If counts is TRUE, the transformation counts are returned as the "counts" attribute of this matrix, as a 3-dimensional array with dimensions corresponding to the elements of x, the elements of y, and the type of transformation (insertions, deletions and substitutions), respectively. Additionally, if partial = FALSE, the transformation sequences are returned as the "trafos" attribute of the return value, as character strings with elements ‘⁠M⁠’, ‘⁠I⁠’, ‘⁠D⁠’ and ‘⁠S⁠’ indicating a match, insertion, deletion and substitution, respectively. If partial = TRUE, the offsets (positions of the first and last element) of the matched substrings are returned as the "offsets" attribute of the return value (with both offsets -1−1 in case of no match).

a <- "hello world"
b <- "helo world!"
attr(adist(a, b, counts = TRUE), "trafos")
#>      [,1]          
#> [1,] "MMDMMMMMMMMI"

reprex package (v2.0.1)

创建于 2022-05-31

第3个字符有删除,字符串末尾有插入a

我们使用diffobj来比较配置文件(在或多或少的生产环境中),它工作得很好。在你的情况下,diffobj::diffChr 不是你想要的吗?

diffobj::diffChr("hello world", "helo world!", color.mode = 'rgb')