adist 函数在文本比较中的问题
Problem with adist function in text comparison
我对 adist 功能有疑问。基本上我使用的是 RDocumentation 的例子。
attr(adist(c("kitten", "sitting"), counts = TRUE), "trafos") here
然而,当我试图 运行 添加一个词时
attr(adist(c("kitten", "sitting", "hi"), counts = TRUE), "trafos")
我正在获取这些结果:
[,1] [,2] [,3]
[1,] "MMMMMM" "SMMMSMI" "SMDDDDI"
[2,] "SMMMSMD" "MMMMMMM" "SDDDMDD"
[3,] "SMIIIID" "SIIIMII" "MMI"
在第三列第三行,我正在使用MMI,但我不明白为什么因为它是同一个词"hi"。所以必须是MM。 (匹配、匹配和无插入)
参考:https://www.rdocumentation.org/packages/utils/versions/3.6.0/topics/adist
我正在使用另一个例子:
test <- c('x','hi', 'y','x')
attr(adist(test, y=NULL , counts = TRUE), "trafos")
我正在接受这些结果。但至少对角线需要是 M,因为是同一个词。
[,1] [,2] [,3] [,4]
[1,] "M" "SI" "SI" "MI"
[2,] "SD" "MM" "SD" "SD"
[3,] "SD" "SI" "MI" "SI"
[4,] "MI" "SI" "SI" "MI"
我不明白这是怎么回事。
正如其他人已经指出的那样,它看起来像是一个错误。使用来自 https://cran.r-project.org/src/base/R-3/R-3.5.3.tar.gz 的源代码并查看文件 src/main/agrep.c 中的行 429-432,有代码正在反转缓冲区:
/* Now reverse the transcript. */
for(k = 0, l = --m; l >= nz; k++, l--)
buf[k] = buf[l];
buf[++k] = '[=10=]';
遍历 gdb 中发生的事情:
$ R -d gdb
GNU gdb (Debian 7.12-6) 7.12.0.20161007-git
...
(gdb) b agrep.c:430
Breakpoint 1 at 0x7222e: file agrep.c, line 430.
(gdb) r
Starting program: /usr/local/lib64/R/bin/exec/R
...
R version 3.5.3 (2019-03-11) -- "Great Truth"
...
然后执行以下R代码:
> attr(adist(c("kitten", "sitting", "hi"), counts = TRUE), "trafos")
Breakpoint 1, adist_full (x=0x555557995a48, y=0x555557995a48, costs=0x5555561567a8, opt_counts=TRUE) at agrep.c:430
430 for(k = 0, l = --m; l >= nz; k++, l--)
在休息处继续 8 以到达最后一个对角线条目:
(gdb) c 8
Will ignore next 7 crossings of breakpoint 1. Continuing.
Breakpoint 1, adist_full (x=0x555557995a48, y=0x555557995a48, costs=0x5555561567a8, opt_counts=TRUE) at agrep.c:430
430 for(k = 0, l = --m; l >= nz; k++, l--)
反转前检查缓冲区:
(gdb) x/6c buf
0x5555566a8da0: 83 'S' 73 'I' 73 'I' 73 'I' 77 'M' 77 'M'
单步执行代码显示 buf[0]
和 buf[1]
是从缓冲区末尾复制的:
(gdb) n
431 buf[k] = buf[l];
(gdb) n
430 for(k = 0, l = --m; l >= nz; k++, l--)
(gdb) p k
= 0
(gdb) n
431 buf[k] = buf[l];
(gdb) n
430 for(k = 0, l = --m; l >= nz; k++, l--)
(gdb) p k
= 1
退出循环 k=2:
(gdb) n
432 buf[++k] = '[=16=]';
(gdb) p k
= 2
++k 为 3:
(gdb) n
433 COUNTS(i, j, 0) = nins;
(gdb) p k
= 3
检查反向缓冲区显示 buf[2]
未设置为 NUL:
(gdb) x/6c buf
0x5555566a8da0: 77 'M' 77 'M' 73 'I' 0 '[=18=]0' 77 'M' 77 'M'
这导致:
[,1] [,2] [,3]
[1,] "MMMMMM" "SMMMSMI" "SMDDDDI"
[2,] "SMMMSMD" "MMMMMMM" "SDDDMDD"
[3,] "SMIIIID" "SIIIMII" "MMI"
用 buf[k] = '[=28=]'
替换 buf[++k] = '[=27=]'
似乎将 NUL 放在正确的位置:
> attr(adist(c("kitten", "sitting", "hi"), counts = TRUE), "trafos")
Breakpoint 1, adist_full (x=0x555557995cb8, y=0x555557995cb8, costs=0x5555561567a8, opt_counts=TRUE) at agrep.c:430
430 for(k = 0, l = --m; l >= nz; k++, l--)
(gdb) c 8
Will ignore next 7 crossings of breakpoint 1. Continuing.
Breakpoint 1, adist_full (x=0x555557995cb8, y=0x555557995cb8, costs=0x5555561567a8, opt_counts=TRUE) at agrep.c:430
430 for(k = 0, l = --m; l >= nz; k++, l--)
(gdb) x/6c buf
0x5555566a8da0: 83 'S' 73 'I' 73 'I' 73 'I' 77 'M' 77 'M'
(gdb) n
431 buf[k] = buf[l];
(gdb) n
430 for(k = 0, l = --m; l >= nz; k++, l--)
(gdb) p k
= 0
(gdb) n
431 buf[k] = buf[l];
(gdb) n
430 for(k = 0, l = --m; l >= nz; k++, l--)
(gdb) p k
= 1
(gdb) n
432 buf[k] = '[=20=]';
(gdb) p k
= 2
(gdb) n
433 COUNTS(i, j, 0) = nins;
(gdb) p k
= 2
(gdb) x/6c buf
0x5555566a8da0: 77 'M' 77 'M' 0 '[=20=]0' 73 'I' 77 'M' 77 'M'
产生预期的输出:
[,1] [,2] [,3]
[1,] "MMMMMM" "SMMMSMI" "SMDDDD"
[2,] "SMMMSMD" "MMMMMMM" "SDDDMDD"
[3,] "SMIIII" "SIIIMII" "MM"
修复后,您的第二个示例结果为:
> test <- c('x','hi', 'y','x')
> attr(adist(test, y=NULL , counts = TRUE), "trafos")
[,1] [,2] [,3] [,4]
[1,] "M" "SI" "S" "M"
[2,] "SD" "MM" "SD" "SD"
[3,] "S" "SI" "M" "S"
[4,] "M" "SI" "S" "M"
结果似乎与 ins、sub 和 del 的其他属性一致。
> adist(c('x', 'hi', 'y', 'x'), counts=TRUE)
[,1] [,2] [,3] [,4]
[1,] 0 2 1 0
[2,] 2 0 2 2
[3,] 1 2 0 1
[4,] 0 2 1 0
attr(,"counts")
, , ins
[,1] [,2] [,3] [,4]
[1,] 0 1 0 0
[2,] 0 0 0 0
[3,] 0 1 0 0
[4,] 0 1 0 0
, , del
[,1] [,2] [,3] [,4]
[1,] 0 0 0 0
[2,] 1 0 1 1
[3,] 0 0 0 0
[4,] 0 0 0 0
, , sub
[,1] [,2] [,3] [,4]
[1,] 0 1 1 0
[2,] 1 0 1 1
[3,] 1 1 0 1
[4,] 0 1 1 0
attr(,"trafos")
[,1] [,2] [,3] [,4]
[1,] "M" "SI" "S" "M"
[2,] "SD" "MM" "SD" "SD"
[3,] "S" "SI" "M" "S"
[4,] "M" "SI" "S" "M"
我对 adist 功能有疑问。基本上我使用的是 RDocumentation 的例子。
attr(adist(c("kitten", "sitting"), counts = TRUE), "trafos") here
然而,当我试图 运行 添加一个词时
attr(adist(c("kitten", "sitting", "hi"), counts = TRUE), "trafos")
我正在获取这些结果:
[,1] [,2] [,3]
[1,] "MMMMMM" "SMMMSMI" "SMDDDDI"
[2,] "SMMMSMD" "MMMMMMM" "SDDDMDD"
[3,] "SMIIIID" "SIIIMII" "MMI"
在第三列第三行,我正在使用MMI,但我不明白为什么因为它是同一个词"hi"。所以必须是MM。 (匹配、匹配和无插入)
参考:https://www.rdocumentation.org/packages/utils/versions/3.6.0/topics/adist
我正在使用另一个例子:
test <- c('x','hi', 'y','x')
attr(adist(test, y=NULL , counts = TRUE), "trafos")
我正在接受这些结果。但至少对角线需要是 M,因为是同一个词。
[,1] [,2] [,3] [,4]
[1,] "M" "SI" "SI" "MI"
[2,] "SD" "MM" "SD" "SD"
[3,] "SD" "SI" "MI" "SI"
[4,] "MI" "SI" "SI" "MI"
我不明白这是怎么回事。
正如其他人已经指出的那样,它看起来像是一个错误。使用来自 https://cran.r-project.org/src/base/R-3/R-3.5.3.tar.gz 的源代码并查看文件 src/main/agrep.c 中的行 429-432,有代码正在反转缓冲区:
/* Now reverse the transcript. */
for(k = 0, l = --m; l >= nz; k++, l--)
buf[k] = buf[l];
buf[++k] = '[=10=]';
遍历 gdb 中发生的事情:
$ R -d gdb
GNU gdb (Debian 7.12-6) 7.12.0.20161007-git
...
(gdb) b agrep.c:430
Breakpoint 1 at 0x7222e: file agrep.c, line 430.
(gdb) r
Starting program: /usr/local/lib64/R/bin/exec/R
...
R version 3.5.3 (2019-03-11) -- "Great Truth"
...
然后执行以下R代码:
> attr(adist(c("kitten", "sitting", "hi"), counts = TRUE), "trafos")
Breakpoint 1, adist_full (x=0x555557995a48, y=0x555557995a48, costs=0x5555561567a8, opt_counts=TRUE) at agrep.c:430
430 for(k = 0, l = --m; l >= nz; k++, l--)
在休息处继续 8 以到达最后一个对角线条目:
(gdb) c 8
Will ignore next 7 crossings of breakpoint 1. Continuing.
Breakpoint 1, adist_full (x=0x555557995a48, y=0x555557995a48, costs=0x5555561567a8, opt_counts=TRUE) at agrep.c:430
430 for(k = 0, l = --m; l >= nz; k++, l--)
反转前检查缓冲区:
(gdb) x/6c buf
0x5555566a8da0: 83 'S' 73 'I' 73 'I' 73 'I' 77 'M' 77 'M'
单步执行代码显示 buf[0]
和 buf[1]
是从缓冲区末尾复制的:
(gdb) n
431 buf[k] = buf[l];
(gdb) n
430 for(k = 0, l = --m; l >= nz; k++, l--)
(gdb) p k
= 0
(gdb) n
431 buf[k] = buf[l];
(gdb) n
430 for(k = 0, l = --m; l >= nz; k++, l--)
(gdb) p k
= 1
退出循环 k=2:
(gdb) n
432 buf[++k] = '[=16=]';
(gdb) p k
= 2
++k 为 3:
(gdb) n
433 COUNTS(i, j, 0) = nins;
(gdb) p k
= 3
检查反向缓冲区显示 buf[2]
未设置为 NUL:
(gdb) x/6c buf
0x5555566a8da0: 77 'M' 77 'M' 73 'I' 0 '[=18=]0' 77 'M' 77 'M'
这导致:
[,1] [,2] [,3]
[1,] "MMMMMM" "SMMMSMI" "SMDDDDI"
[2,] "SMMMSMD" "MMMMMMM" "SDDDMDD"
[3,] "SMIIIID" "SIIIMII" "MMI"
用 buf[k] = '[=28=]'
替换 buf[++k] = '[=27=]'
似乎将 NUL 放在正确的位置:
> attr(adist(c("kitten", "sitting", "hi"), counts = TRUE), "trafos")
Breakpoint 1, adist_full (x=0x555557995cb8, y=0x555557995cb8, costs=0x5555561567a8, opt_counts=TRUE) at agrep.c:430
430 for(k = 0, l = --m; l >= nz; k++, l--)
(gdb) c 8
Will ignore next 7 crossings of breakpoint 1. Continuing.
Breakpoint 1, adist_full (x=0x555557995cb8, y=0x555557995cb8, costs=0x5555561567a8, opt_counts=TRUE) at agrep.c:430
430 for(k = 0, l = --m; l >= nz; k++, l--)
(gdb) x/6c buf
0x5555566a8da0: 83 'S' 73 'I' 73 'I' 73 'I' 77 'M' 77 'M'
(gdb) n
431 buf[k] = buf[l];
(gdb) n
430 for(k = 0, l = --m; l >= nz; k++, l--)
(gdb) p k
= 0
(gdb) n
431 buf[k] = buf[l];
(gdb) n
430 for(k = 0, l = --m; l >= nz; k++, l--)
(gdb) p k
= 1
(gdb) n
432 buf[k] = '[=20=]';
(gdb) p k
= 2
(gdb) n
433 COUNTS(i, j, 0) = nins;
(gdb) p k
= 2
(gdb) x/6c buf
0x5555566a8da0: 77 'M' 77 'M' 0 '[=20=]0' 73 'I' 77 'M' 77 'M'
产生预期的输出:
[,1] [,2] [,3]
[1,] "MMMMMM" "SMMMSMI" "SMDDDD"
[2,] "SMMMSMD" "MMMMMMM" "SDDDMDD"
[3,] "SMIIII" "SIIIMII" "MM"
修复后,您的第二个示例结果为:
> test <- c('x','hi', 'y','x')
> attr(adist(test, y=NULL , counts = TRUE), "trafos")
[,1] [,2] [,3] [,4]
[1,] "M" "SI" "S" "M"
[2,] "SD" "MM" "SD" "SD"
[3,] "S" "SI" "M" "S"
[4,] "M" "SI" "S" "M"
结果似乎与 ins、sub 和 del 的其他属性一致。
> adist(c('x', 'hi', 'y', 'x'), counts=TRUE)
[,1] [,2] [,3] [,4]
[1,] 0 2 1 0
[2,] 2 0 2 2
[3,] 1 2 0 1
[4,] 0 2 1 0
attr(,"counts")
, , ins
[,1] [,2] [,3] [,4]
[1,] 0 1 0 0
[2,] 0 0 0 0
[3,] 0 1 0 0
[4,] 0 1 0 0
, , del
[,1] [,2] [,3] [,4]
[1,] 0 0 0 0
[2,] 1 0 1 1
[3,] 0 0 0 0
[4,] 0 0 0 0
, , sub
[,1] [,2] [,3] [,4]
[1,] 0 1 1 0
[2,] 1 0 1 1
[3,] 1 1 0 1
[4,] 0 1 1 0
attr(,"trafos")
[,1] [,2] [,3] [,4]
[1,] "M" "SI" "S" "M"
[2,] "SD" "MM" "SD" "SD"
[3,] "S" "SI" "M" "S"
[4,] "M" "SI" "S" "M"