R,bit64,在 data.table 中计算行均值和标准差时出现问题

R, bit64, problems calculating row mean and standard deviation in data.table

我正在尝试处理更大的数字,超过 2^32。虽然我也在使用 data.table 和 fread,但我不认为问题与它们有关。我可以在不更改 data.table 或使用 fread 的情况下打开和关闭它们的症状。我的症状是,当我期望正指数 1e+3 到 1e+17

时,我得到的报告平均值为 4.1e-302

使用bit64包和integer64相关函数时一直出现该问题。 "regular sized data and R" 中的东西对我有用,但我没有在这个包中正确表达东西。请参阅下面我的代码和数据。

我使用的是 MacBook Pro,16GB,i7(已更新)。

我已经重新启动了我的 R 会话并清除了工作区,但问题仍然存在。

请指教,非常感谢您的意见。我认为它必须与使用库 bit64 相关。

包括我查看的链接 bit64 doc

An issue that had similar symptoms caused by an fread() memory leak, but I think I eliminated

这是我的输入数据

var1,var2,var3,var4,var5,var6,expected_row_mean,expected_row_stddev
1000 ,993 ,987 ,1005 ,986 ,1003 ,996 ,8 
100000 ,101040 ,97901 ,100318 ,96914 ,97451 ,98937 ,1722 
10000000 ,9972997 ,9602778 ,9160554 ,8843583 ,8688500 ,9378069 ,565637 
1000000000 ,1013849241 ,973896894 ,990440721 ,1030267777 ,1032689982 ,1006857436 ,23096234 
100000000000 ,103171209097 ,103660949260 ,102360301140 ,103662297222 ,106399064194 ,103208970152 ,2078732545 
10000000000000 ,9557954451905 ,9241065464713 ,9357562691674 ,9376495364909 ,9014072235909 ,9424525034852 ,334034298683 
1000000000000000 ,985333546044881 ,994067361457872 ,1034392968759970 ,1057553099903410 ,1018695335152490 ,1015007051886440 ,27363415718203 
100000000000000000 ,98733768902499600 ,103316759127969000 ,108062824583319000 ,111332326225036000 ,108671041505404000 ,105019453390705000 ,5100048567944390 

我的代码,使用此示例数据

# file: problem_bit64.R
# OBJECTIVE: Using larger numbers, I want to calculate a row mean and row standard deviation
# ERROR:  I don't know what I am doing wrong to get such errors, seems bit64 related
# PRIORITY: BLOCKED (do this in Python instead?)
# reported Sat 9/24/2016 by Greg

# sample data:
# each row is 100 times larger on average, for 8 rows, starting with 1,000
# for the vars within a row, there is 10% uniform random variation.  B2 = ROUND(A2+A2*0.1*(RAND()-0.5),0)    

# Install development version of data.table --> for fwrite()
install.packages("data.table", repos = "https://Rdatatable.github.io/data.table", type = "source")
require(data.table)
require(bit64)
.Machine$integer.max   # 2147483647     Is this an issue ?
.Machine$double.xmax   # 1.797693e+308  I assume not

# -------------------------------------------------------------------
# ---- read in and basic info that works
csv_in <- "problem_bit64.csv"
dt <- fread( csv_in )
dim(dt)                # 6 8
lapply(dt, class)      # "integer64" for all 8
names(dt)  # "var1" "var2"  "var3"  "var4"  "var5" "var6" "expected_row_mean" "expected_row_stddev"
dtin <- dt[, 1:6, with=FALSE]  # just save the 6 input columns

...现在问题开始了

# -------------------------------------------------------------------
# ---- CALCULATION PROBLEMS START HERE
# ---- for each row, I want to calculate the mean and standard deviation
a <- apply(dtin, 1, mean.integer64); a   # get 8 values like 4.9e-321
b <- apply(dtin, 2, mean.integer64); b   # get 6 values like 8.0e-308

# ---- try secondary variations that do not work
c <- apply(dtin, 1, mean); c             # get 8 values like 4.9e-321
c <- apply(dtin, 1, mean.integer64); c   # same result
c <- apply(dtin, 1, function(x) mean(x));   c          # same
c <- apply(dtin, 1, function(x) sum(x)/length(x));  c  # same results as mean(x)

##### I don't see any sd.integer64       # FEATURE REQUEST, Z-TRANSFORM IS COMMON
c <- apply(dtin, 1, function(x) sd(x));   c          # unrealistic values - see expected

常规数据上的常规大小 R,仍然使用通过 fread() 读入 data.table() 的数据 - WORKS

# -------------------------------------------------------------------
# ---- delete big numbers, and try regular stuff - WHICH WORKS
dtin2 <- dtin[ 1:3, ]    # just up to about 10 million (SAME DATA, SAME FREAD, SAME DATA.TABLE)
dtin2[ , var1 := as.integer(var1) ]  # I know there are fancier ways to do this
dtin2[ , var2 := as.integer(var2) ]  # but I want things to work before getting fancy.
dtin2[ , var3 := as.integer(var3) ]
dtin2[ , var4 := as.integer(var4) ]
dtin2[ , var5 := as.integer(var5) ]
dtin2[ , var6 := as.integer(var6) ]
lapply( dtin2, class )   # validation

c <- apply(dtin2, 1, mean); c   # get 3 row values AS EXPECTED (matching expected columns)
c <- apply(dtin2, 1, function(x) mean(x));   c          # CORRECT
c <- apply(dtin2, 1, function(x) sum(x)/length(x));  c  # same results as mean(x)

c <- apply(dtin2, 1, sd); c             # get 3 row values AS EXPECTED (matching expected columns)
c <- apply(dtin2, 1, function(x) sd(x));   c          # CORRECT

作为对大多数读者的简短和首要建议:请使用 'double' 而不是 'integer64' 除非您有特定原因使用 64 位整数。 'double' 是 R 内部数据类型,而 'integer64' 是包扩展数据类型,表示为具有 class 属性 'integer64' 的 'double' 向量,即每个元素 64 位被知道这个 class 的代码解释为 64 位整数。不幸的是,许多核心 R 函数不知道 'integer64',这很容易导致错误的结果。因此强制 'double'

dtind <- dtin
for (i in seq_along(dtind))
  dtind[[i]] <- as.double(dtind[[i]])
b <- apply(dtind, 1, mean)

会给出一些预期的结果

> b
[1] 9.956667e+02 9.893733e+04 9.378069e+06 1.006857e+09 1.032090e+11 9.424525e+12 1.015007e+15 1.050195e+17

虽然不完全符合您的预期,但也没有考虑四舍五入的差异

> b - dt$expected_row_mean
integer64
[1] -1   0    -1   -1   0    -1   -3   -392

也不查看未舍入的差异

> b - as.double(dt$expected_row_mean)
[1]   -0.3333333    0.3333333   -0.3333333   -0.1666666    0.1666718 -0.3339844   -2.8750000 -384.0000000
Warnmeldung:
In as.double.integer64(dt$expected_row_mean) :
  integer precision lost while converting to double

好的,我们假设您确实想要 integer64,因为您的最大数字超出了双精度的整数精度 2^52。然后你的问题开始于 'apply' 不知道 integer64 并且实际上破坏了 'integer64' class 属性:

> apply(dtin, 1, is.integer64)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

它实际上破坏了 'integer64' class 属性两次,一次是在准备输入时,一次是在对输出进行后处理时。我们可以通过

解决这个问题
c <- apply(dtin, 1, function(x){
  oldClass(x) <- "integer64"  # fix 
  mean(x) # note that this dispatches to mean.integer64
})
oldClass(c) <- "integer64"  # fix again

现在结果看起来合理

> c
integer64
[1] 995                98937              9378068            1006857435         103208970152       9424525034851      1015007051886437   105019453390704600

但仍然不是您所期望的

> c - dt$expected_row_mean
integer64
[1] -1   0    -1   -1   0    -1   -3   -400

小差异 (-1) 是由于四舍五入,因为浮动平均值

> b[1]
[1] 995.6667

你假设

> dt$expected_row_mean[1]
integer64
[1] 996

while mean.integer64 强制(截断)为 integer64。 mean.integer64 的这种行为是 debatable,但是,至少是一致的:

x <- seq(0, 1, 0.25)
> data.frame(x=x, y=as.integer64(0) + x)
     x y
1 0.00 0
2 0.25 0
3 0.50 0
4 0.75 0
5 1.00 1
> mean(as.integer64(0:1))
integer64
[1] 0

四舍五入的主题清楚地表明,实施 sd.integer64 会更加 debatable。应该 return integer64 还是 double?

关于较大的差异,尚不清楚您期望的理由是什么:取 table 的第七行并减去其最小值

x <- (unlist(dtin[7,]))
oldClass(x) <- "integer64"
y <- min(x)
z <- as.double(x - y)

给出范围内的数字,其中 'double' 精确处理整数

> log2(z)
[1] 43.73759     -Inf 42.98975 45.47960 46.03745 44.92326

取平均值并与您的预期进行比较仍然会产生无法通过四舍五入来解释的差异

> mean(z) - as.double(dt$expected_row_mean[7] - y)
[1] -2.832031