如何使用 data.table 按组对变量的随机值进行子集化?
How to subset a random value of a variable by group with data.table?
我想在 data.table.
中按组获取变量的最小样本和随机样本
data.table(ggplot2::movies)[, list(min=min(rating), random=sample(rating, 1)), by=list(year, Action)]
无效:
Error in `[.data.table`(data.table(movies), , list(min(rating), sample(rating, :
Column 2 of result for group 88 is type 'integer' but expecting type 'double'. Column types must be consistent for each group.
如果我强制它为数字,我会得到这个惊人的结果:随机评分低于 (?!!) 同一类别的最小值的类别。
data.table(ggplot2::movies)[, list(min=min(rating), random=as.numeric(sample(rating, 1))), by=list(year, Action)][random<min]
year Action min random
1: 1916 1 6.2 6
2: 1911 1 5.7 1
3: 1901 1 4.2 3
4: 1914 1 6.1 6
5: 1923 1 8.2 4
6: 1918 1 5.9 5
7: 1921 1 7.5 4
使用.SD
不会改变任何东西:
data.table(ggplot2::movies)[, list(min=min(rating), random=as.numeric(sample(.SD$rating, 1))), by=list(year, Action)][random<min]
year Action min random
1: 1916 1 6.2 2
2: 1911 1 5.7 4
3: 1893 0 7.0 2
4: 1901 1 4.2 4
5: 1914 1 6.1 5
6: 1923 1 8.2 8
7: 1918 1 5.9 4
更糟糕的是,当变量为整数时,不会出现错误:
data.table(ggplot2::movies)[, list(min=min(votes), random=sample(votes, 1)), by=list(year, Action)][random<min]
year Action min random
1: 1916 1 135 43
2: 1911 1 26 2
3: 1893 0 90 52
4: 1901 1 13 12
5: 1923 1 757 368
6: 1918 1 60 49
7: 1921 1 73 48
显然 sample
函数不想在子集上工作...
求助!
我终于找到了解决方法。但它并没有说明为什么 sample() 在子集上没有按预期工作。
data.table(movies)[, list(min=min(votes), random=votes[sample(1:.N, 1)]), by=list(year, Action)]
year Action min random
1: 1971 0 5 77
2: 1939 0 5 13
3: 1941 0 5 7
4: 1996 0 5 4066
5: 1975 0 5 6
---
201: 1931 1 8 8
202: 1928 1 17 41
203: 1923 1 757 757
204: 1918 1 60 60
205: 1921 1 73 73
前面描述的奇怪行为到此结束:
data.table(movies)[, list(min=min(votes), random=votes[sample(1:.N, 1)]), by=list(year, Action)][random<min]
Empty data.table (0 rows) of 4 cols: year,Action,min,random
您掉进了标准的 sample
陷阱。来自 ?sample
:
If x has length 1, is numeric (in the sense of is.numeric) and x >= 1,
sampling via sample takes place from 1:x. Note that this convenience
feature may lead to undesired behaviour when x is of varying length in
calls such as sample(x).
使用例如来自 ?sample
.
的 resample
建议
我想在 data.table.
中按组获取变量的最小样本和随机样本data.table(ggplot2::movies)[, list(min=min(rating), random=sample(rating, 1)), by=list(year, Action)]
无效:
Error in `[.data.table`(data.table(movies), , list(min(rating), sample(rating, :
Column 2 of result for group 88 is type 'integer' but expecting type 'double'. Column types must be consistent for each group.
如果我强制它为数字,我会得到这个惊人的结果:随机评分低于 (?!!) 同一类别的最小值的类别。
data.table(ggplot2::movies)[, list(min=min(rating), random=as.numeric(sample(rating, 1))), by=list(year, Action)][random<min]
year Action min random
1: 1916 1 6.2 6
2: 1911 1 5.7 1
3: 1901 1 4.2 3
4: 1914 1 6.1 6
5: 1923 1 8.2 4
6: 1918 1 5.9 5
7: 1921 1 7.5 4
使用.SD
不会改变任何东西:
data.table(ggplot2::movies)[, list(min=min(rating), random=as.numeric(sample(.SD$rating, 1))), by=list(year, Action)][random<min]
year Action min random
1: 1916 1 6.2 2
2: 1911 1 5.7 4
3: 1893 0 7.0 2
4: 1901 1 4.2 4
5: 1914 1 6.1 5
6: 1923 1 8.2 8
7: 1918 1 5.9 4
更糟糕的是,当变量为整数时,不会出现错误:
data.table(ggplot2::movies)[, list(min=min(votes), random=sample(votes, 1)), by=list(year, Action)][random<min]
year Action min random
1: 1916 1 135 43
2: 1911 1 26 2
3: 1893 0 90 52
4: 1901 1 13 12
5: 1923 1 757 368
6: 1918 1 60 49
7: 1921 1 73 48
显然 sample
函数不想在子集上工作...
求助!
我终于找到了解决方法。但它并没有说明为什么 sample() 在子集上没有按预期工作。
data.table(movies)[, list(min=min(votes), random=votes[sample(1:.N, 1)]), by=list(year, Action)]
year Action min random
1: 1971 0 5 77
2: 1939 0 5 13
3: 1941 0 5 7
4: 1996 0 5 4066
5: 1975 0 5 6
---
201: 1931 1 8 8
202: 1928 1 17 41
203: 1923 1 757 757
204: 1918 1 60 60
205: 1921 1 73 73
前面描述的奇怪行为到此结束:
data.table(movies)[, list(min=min(votes), random=votes[sample(1:.N, 1)]), by=list(year, Action)][random<min]
Empty data.table (0 rows) of 4 cols: year,Action,min,random
您掉进了标准的 sample
陷阱。来自 ?sample
:
If x has length 1, is numeric (in the sense of is.numeric) and x >= 1, sampling via sample takes place from 1:x. Note that this convenience feature may lead to undesired behaviour when x is of varying length in calls such as sample(x).
使用例如来自 ?sample
.
resample
建议