Kolmogorov-Smirnov 测试:C 到 R 的翻译问题
Kolomogorov-Smirnov test: C to R translation issue
我在将算法从 C 转换为 R 时遇到困难。这是关于 Kolmogorov Smirnov 检验,更具体地说是 KS 概率函数
在'Numerical Recipes in C'、'probks'中编码为
#include <math.h>
#define EPS1 0.001
#define EPS2 1.0e-8
float probks(float alam)
/*Kolmogorov-Smirnov probability function.*/
{
int j;
float a2,fac=2.0,sum=0.0,term,termbf=0.0;
a2 = -2.0*alam*alam;
for (j=1;j<=100;j++) {
term=fac*exp(a2*j*j);
sum += term;
if (fabs(term) <= EPS1*termbf || fabs(term) <= EPS2*sum) return sum;
fac = -fac; /*Alternating signs in sum.*/
termbf=fabs(term);
}
return 1.0; /* Get here only by failing to converge. */
}
我不知道如何处理最后几行在 R 中的翻译,我现在只有
PROBKS <- function(lambda) {
EPS1 <- 0.001; EPS2 <- 1.0e-8;
sum <- 0.0; fac <- 2.0; termbf <- 0.0;
a2 <- -2*lambda*lambda
for (j in 1:100) {
term <- fac * exp(a2*j*j)
sum <- sum + term
if ( (abs(term) <= EPS1*termbf) || (abs(term) <= EPS2*sum) ) {
break
} else {
fac <- -fac
}
}
termbf <- abs(term)
return(sum)
}
但这会产生一个非单调的概率函数
它应该是 $Q_KS(0) = 1$ 和 $Q_KS(\infty) = 0$。
显然,这是关于如何 interpret/encode 最后一个 'if' 语句。
如有任何帮助,我们将不胜感激。 M
编辑 1:
这是我的会话信息
> sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: i386-w64-mingw32/i386 (32-bit)
Running under: Windows >= 8 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=English_United Kingdom.1252
[2] LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] reshape2_1.4.3 forcats_0.3.0 stringr_1.3.1 dplyr_0.7.7
[5] purrr_0.2.5 readr_1.1.1 tidyr_0.8.1 tibble_1.4.2
[9] ggplot2_3.1.0 tidyverse_1.2.1
loaded via a namespace (and not attached):
[1] withr_2.1.2 rvest_0.3.2 tidyselect_0.2.5 lattice_0.20-35
[5] pkgconfig_2.0.2 xml2_1.2.0 compiler_3.4.4 readxl_1.1.0
[9] Rcpp_0.12.19 cli_1.0.1 plyr_1.8.4 cellranger_1.1.0
[13] httr_1.3.1 tools_3.4.4 nlme_3.1-131.1 broom_0.5.0
[17] R6_2.3.0 bindrcpp_0.2.2 bindr_0.1.1 scales_1.0.0
[21] assertthat_0.2.0 gtable_0.2.0 stringi_1.1.7 rstudioapi_0.8
[25] backports_1.1.2 hms_0.4.2 munsell_0.5.0 grid_3.4.4
[29] colorspace_1.3-2 glue_1.3.0 lubridate_1.7.4 rlang_0.3.0.1
[33] magrittr_1.5 lazyeval_0.2.1 yaml_2.2.0 crayon_1.3.4
[37] haven_1.1.2 modelr_0.1.2 pillar_1.3.0 jsonlite_1.5
编辑 2
使用 Konrad 函数 ks_cdf 和
x = seq(0, 1, by = 0.01)
plot(x, ks_cdf(x))
仍然在 0 处给出 0
编辑 3
升级到 3.6.1 后
> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
...
我仍然得到与上面相同的情节,即 ks_cdf(0)=0 而它应该是 ks_sdf(0)=1
代码几乎可以按字面意思翻译成 R——不清楚你为什么无缘无故地偏离 C 代码。这是一个字面的,稍微清理过的翻译:
ks_cdf = function (lambda) {
EPS1 = 0.001
EPS2 = 1.0e-8
sum = 0
fac = 2
termbf = 0
a2 = -2 * lambda ^ 2
for (j in 1 : 100) {
term = fac * exp(a2 * j ^ 2)
sum = sum + term
if ((abs(term) <= EPS1 * termbf) || (abs(term) <= EPS2 * sum)) {
return(sum)
} else {
fac = -fac
termbf = abs(term)
}
}
1 # Failed to converge.
}
此代码有效但未矢量化,这是我为实际实现而更改的内容(但是,这样做,我们将失去提前退出)。
这是使用向量化算术和矩阵乘法的惯用 R 实现:
ks_cdf = function (λ) {
eps1 = 0.001
eps2 = 1E-8
range = seq(1, 100)
terms = (-1) ^ (range - 1) * exp(-2 * range ^ 2 %*% t(λ ^ 2))
sums = 2 * colSums(terms)
pterms = abs(terms)
prev_pterms = rbind(0, pterms[-nrow(pterms), , drop = FALSE])
converged = apply(pterms <= eps1 * prev_pterms | pterms <= eps2 * sums, 2L, any)
sums[! converged] = 1
sums
}
并展示它的矢量化有多好,这实际上是一件大事:
x = seq(0, 1, by = 0.01)
plot(x, ks_cdf(x))
我在将算法从 C 转换为 R 时遇到困难。这是关于 Kolmogorov Smirnov 检验,更具体地说是 KS 概率函数
在'Numerical Recipes in C'、'probks'中编码为
#include <math.h>
#define EPS1 0.001
#define EPS2 1.0e-8
float probks(float alam)
/*Kolmogorov-Smirnov probability function.*/
{
int j;
float a2,fac=2.0,sum=0.0,term,termbf=0.0;
a2 = -2.0*alam*alam;
for (j=1;j<=100;j++) {
term=fac*exp(a2*j*j);
sum += term;
if (fabs(term) <= EPS1*termbf || fabs(term) <= EPS2*sum) return sum;
fac = -fac; /*Alternating signs in sum.*/
termbf=fabs(term);
}
return 1.0; /* Get here only by failing to converge. */
}
我不知道如何处理最后几行在 R 中的翻译,我现在只有
PROBKS <- function(lambda) {
EPS1 <- 0.001; EPS2 <- 1.0e-8;
sum <- 0.0; fac <- 2.0; termbf <- 0.0;
a2 <- -2*lambda*lambda
for (j in 1:100) {
term <- fac * exp(a2*j*j)
sum <- sum + term
if ( (abs(term) <= EPS1*termbf) || (abs(term) <= EPS2*sum) ) {
break
} else {
fac <- -fac
}
}
termbf <- abs(term)
return(sum)
}
但这会产生一个非单调的概率函数
它应该是 $Q_KS(0) = 1$ 和 $Q_KS(\infty) = 0$。 显然,这是关于如何 interpret/encode 最后一个 'if' 语句。
如有任何帮助,我们将不胜感激。 M
编辑 1: 这是我的会话信息
> sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: i386-w64-mingw32/i386 (32-bit)
Running under: Windows >= 8 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=English_United Kingdom.1252
[2] LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] reshape2_1.4.3 forcats_0.3.0 stringr_1.3.1 dplyr_0.7.7
[5] purrr_0.2.5 readr_1.1.1 tidyr_0.8.1 tibble_1.4.2
[9] ggplot2_3.1.0 tidyverse_1.2.1
loaded via a namespace (and not attached):
[1] withr_2.1.2 rvest_0.3.2 tidyselect_0.2.5 lattice_0.20-35
[5] pkgconfig_2.0.2 xml2_1.2.0 compiler_3.4.4 readxl_1.1.0
[9] Rcpp_0.12.19 cli_1.0.1 plyr_1.8.4 cellranger_1.1.0
[13] httr_1.3.1 tools_3.4.4 nlme_3.1-131.1 broom_0.5.0
[17] R6_2.3.0 bindrcpp_0.2.2 bindr_0.1.1 scales_1.0.0
[21] assertthat_0.2.0 gtable_0.2.0 stringi_1.1.7 rstudioapi_0.8
[25] backports_1.1.2 hms_0.4.2 munsell_0.5.0 grid_3.4.4
[29] colorspace_1.3-2 glue_1.3.0 lubridate_1.7.4 rlang_0.3.0.1
[33] magrittr_1.5 lazyeval_0.2.1 yaml_2.2.0 crayon_1.3.4
[37] haven_1.1.2 modelr_0.1.2 pillar_1.3.0 jsonlite_1.5
编辑 2 使用 Konrad 函数 ks_cdf 和
x = seq(0, 1, by = 0.01)
plot(x, ks_cdf(x))
仍然在 0 处给出 0
编辑 3 升级到 3.6.1 后
> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
...
我仍然得到与上面相同的情节,即 ks_cdf(0)=0 而它应该是 ks_sdf(0)=1
代码几乎可以按字面意思翻译成 R——不清楚你为什么无缘无故地偏离 C 代码。这是一个字面的,稍微清理过的翻译:
ks_cdf = function (lambda) {
EPS1 = 0.001
EPS2 = 1.0e-8
sum = 0
fac = 2
termbf = 0
a2 = -2 * lambda ^ 2
for (j in 1 : 100) {
term = fac * exp(a2 * j ^ 2)
sum = sum + term
if ((abs(term) <= EPS1 * termbf) || (abs(term) <= EPS2 * sum)) {
return(sum)
} else {
fac = -fac
termbf = abs(term)
}
}
1 # Failed to converge.
}
此代码有效但未矢量化,这是我为实际实现而更改的内容(但是,这样做,我们将失去提前退出)。
这是使用向量化算术和矩阵乘法的惯用 R 实现:
ks_cdf = function (λ) {
eps1 = 0.001
eps2 = 1E-8
range = seq(1, 100)
terms = (-1) ^ (range - 1) * exp(-2 * range ^ 2 %*% t(λ ^ 2))
sums = 2 * colSums(terms)
pterms = abs(terms)
prev_pterms = rbind(0, pterms[-nrow(pterms), , drop = FALSE])
converged = apply(pterms <= eps1 * prev_pterms | pterms <= eps2 * sums, 2L, any)
sums[! converged] = 1
sums
}
并展示它的矢量化有多好,这实际上是一件大事:
x = seq(0, 1, by = 0.01)
plot(x, ks_cdf(x))