dcast.data.table 中的大数字错误
Error with large numerics in dcast.data.table
给定一个数据框,我尝试使用 library(data.table)
中的 dcast.data.table
函数从长到宽进行转换。但是,当在公式左侧使用大数字时,它会以某种方式组合。
下面是一个例子:
df <- structure(list(A = c(10000000007624, 10000000007619, 10000000007745,
10000000007624, 10000000007767, 10000000007729, 10000000007705,
10000000007711, 10000000007784, 10000000007745, 10000000007624,
10000000007762, 10000000007762, 10000000007631, 10000000007762,
10000000007619, 10000000007628, 10000000007705, 10000000007762,
10000000007624, 10000000007745, 10000000007706, 10000000007767,
10000000007777, 10000000007624, 10000000007745, 10000000007624,
10000000007777, 10000000007771, 10000000007631, 10000000007624,
10000000007640, 10000000007642, 10000000007708, 10000000007711,
10000000007745, 10000000007767, 10000000007655, 10000000007722,
10000000007745, 10000000007762, 10000000007771, 10000000007617
), B = c(4060697L, 7683673L, 7699192L, 1322422L, 7754939L, 7448486L,
2188027L, 1061376L, 2095950L, 7793530L, 2095950L, 6447861L, 2188027L,
7448451L, 7428427L, 7516354L, 7067801L, 2095950L, 6740142L, 405911L,
4057215L, 1061345L, 7754945L, 7501748L, 2188027L, 7780980L, 6651988L,
6649330L, 6655118L, 6556367L, 6463510L, 2347462L, 7675114L, 6556361L,
1061345L, 7224099L, 6463515L, 2188027L, 6463515L, 7311234L, 7764971L,
7224099L, 2347479L), C = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L,
3L, 3L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 25L, 2L, 1L, 2L,
1L, 1L, 1L)), .Names = c("A", "B", "C"), row.names = c(NA, -43L
), class = "data.frame")
df <- as.data.table(df)
output <- dcast.data.table(df, A ~ B, value.var = "C",
fun.aggregate = sum, fill = NA)
这只会产生 2 行,10000000007624
和 10000000007784
,所有内容都将汇总在这两行中。
使用 reshape2::dcast
函数时不会出现此错误,此方法会产生正确的结果。
dcast.data.table
产生此错误是否有原因?
问题是在 github 上提出的,由@jangorecki 回复,这个答案来自 setNumericRounding
帮助文档。
when joining or grouping, data.table rounds such data to apx 11 s.f. which is plenty of digits for many cases. This is achieved by rounding the last 2 bytes off the significand.
因此我的 14 位大数字四舍五入并因此合并。
正如@jangorecki 提到的,这可以通过设置 setNumericRounding(0)
来避免。但是,我个人已经将我的大数字重新分类为因素。这对我的特定用例更有意义。
此外,@jangorecki 还建议在处理大数字时使用 bit64
包。
原 post 上 github。
给定一个数据框,我尝试使用 library(data.table)
中的 dcast.data.table
函数从长到宽进行转换。但是,当在公式左侧使用大数字时,它会以某种方式组合。
下面是一个例子:
df <- structure(list(A = c(10000000007624, 10000000007619, 10000000007745,
10000000007624, 10000000007767, 10000000007729, 10000000007705,
10000000007711, 10000000007784, 10000000007745, 10000000007624,
10000000007762, 10000000007762, 10000000007631, 10000000007762,
10000000007619, 10000000007628, 10000000007705, 10000000007762,
10000000007624, 10000000007745, 10000000007706, 10000000007767,
10000000007777, 10000000007624, 10000000007745, 10000000007624,
10000000007777, 10000000007771, 10000000007631, 10000000007624,
10000000007640, 10000000007642, 10000000007708, 10000000007711,
10000000007745, 10000000007767, 10000000007655, 10000000007722,
10000000007745, 10000000007762, 10000000007771, 10000000007617
), B = c(4060697L, 7683673L, 7699192L, 1322422L, 7754939L, 7448486L,
2188027L, 1061376L, 2095950L, 7793530L, 2095950L, 6447861L, 2188027L,
7448451L, 7428427L, 7516354L, 7067801L, 2095950L, 6740142L, 405911L,
4057215L, 1061345L, 7754945L, 7501748L, 2188027L, 7780980L, 6651988L,
6649330L, 6655118L, 6556367L, 6463510L, 2347462L, 7675114L, 6556361L,
1061345L, 7224099L, 6463515L, 2188027L, 6463515L, 7311234L, 7764971L,
7224099L, 2347479L), C = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L,
3L, 3L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 25L, 2L, 1L, 2L,
1L, 1L, 1L)), .Names = c("A", "B", "C"), row.names = c(NA, -43L
), class = "data.frame")
df <- as.data.table(df)
output <- dcast.data.table(df, A ~ B, value.var = "C",
fun.aggregate = sum, fill = NA)
这只会产生 2 行,10000000007624
和 10000000007784
,所有内容都将汇总在这两行中。
使用 reshape2::dcast
函数时不会出现此错误,此方法会产生正确的结果。
dcast.data.table
产生此错误是否有原因?
问题是在 github 上提出的,由@jangorecki 回复,这个答案来自 setNumericRounding
帮助文档。
when joining or grouping, data.table rounds such data to apx 11 s.f. which is plenty of digits for many cases. This is achieved by rounding the last 2 bytes off the significand.
因此我的 14 位大数字四舍五入并因此合并。
正如@jangorecki 提到的,这可以通过设置 setNumericRounding(0)
来避免。但是,我个人已经将我的大数字重新分类为因素。这对我的特定用例更有意义。
此外,@jangorecki 还建议在处理大数字时使用 bit64
包。
原 post 上 github。