table() 有有效的替代方法吗?
Is there an efficient alternative to table()?
我使用以下命令:
table(factor("list",levels=1:"n")
使用“列表”:(示例)a = c(1,3,4,4,3)
和 levels = 1:5
,还要考虑 2 和 5。
对于非常大的数据集,我的代码似乎非常低效。
有谁知道隐藏的库或代码片段可以使它更快?
这不是你要找的东西,但也许你可以使用这个:
library(dplyr)
set.seed(8192)
df <- data.frame(X1 = sample(1:10, 100, replace = TRUE))
df %>%
count(X1)
returns
X1 n
1 1 9
2 2 6
3 3 15
4 4 13
5 5 11
6 6 9
7 7 7
8 8 9
9 9 11
10 10 10
如果您需要计算更多数字(包括遗漏的数字),您可以使用
library(tidyr)
library(dplyr)
df2 <- data.frame(X1 = 1:12)
df %>%
count(X1) %>%
right_join(df2, by="X1") %>%
mutate(n = replace_na(n, 0L))
得到
X1 n
1 1 9
2 2 6
3 3 15
4 4 13
5 5 11
6 6 9
7 7 7
8 8 9
9 9 11
10 10 10
11 11 0
12 12 0
我们可以使用 collapse
中的 fnobs
,这会很有效
library(collapse)
fnobs(df, g = df$X1)
在base R
中,tabulate
比table
更有效率
tabulate(df$X1)
[1] 9 6 15 13 11 9 7 9 11 10
我们也可以使用 janitor::tabyl
:
library(janitor)
df %>%
tabyl(X1) %>%
adorn_totals()
X1 n percent
1 9 0.09
2 6 0.06
3 15 0.15
4 13 0.13
5 11 0.11
6 9 0.09
7 7 0.07
8 9 0.09
9 11 0.11
10 10 0.10
Total 100 1.00
使用 aggregate
的基本 R 选项(从 借用 df
)
> aggregate(. ~ X1, cbind(df, n = 1), sum)
X1 n
1 1 9
2 2 6
3 3 15
4 4 13
5 5 11
6 6 9
7 7 7
8 8 9
9 9 11
10 10 10
另一种选择是使用 hist
> hist(df$X1, plot = FALSE, right = FALSE)[c("breaks", "counts")]
$breaks
[1] 1 2 3 4 5 6 7 8 9 10
$counts
[1] 9 6 15 13 11 9 7 9 21
TL;DR 获胜者是 base::tabulate
。
总而言之,基础 objective 是一种表现,所以我准备了 microbenchmark
所有提供的解决方案。我使用小向量和大向量,两种不同的场景。对于我机器上的 collapse
包,我必须下载最新的 Rcpp
包 1.0.7(以抑制崩溃)。即使是我添加的 也比 base::tabulate
.
慢
suppressMessages(library(janitor))
suppressMessages(library(collapse))
suppressMessages(library(dplyr))
suppressMessages(library(cpp11))
# source
Rcpp::cppFunction('IntegerVector tabulate_rcpp(const IntegerVector& x, const unsigned max) {
IntegerVector counts(max);
for (auto& now : x) {
if (now > 0 && now <= max)
counts[now - 1]++;
}
return counts;
}')
set.seed(1234)
a = c(1,3,4,4,3)
levels = 1:5
df <- data.frame(X1 = a)
microbenchmark::microbenchmark(tabulate_rcpp = {tabulate_rcpp(df$X1, max(df$X1))},
base_table = {base::table(factor(df$X1, 1:max(df$X1)))},
stats_aggregate = {stats::aggregate(. ~ X1, cbind(df, n = 1), sum)},
graphics_hist = {hist(df$X1, plot = FALSE, right = FALSE)[c("breaks", "counts")]},
janitor_tably = {adorn_totals(tabyl(df, X1))},
collapse_fnobs = {fnobs(df, df$X1)},
base_tabulate = {tabulate(df$X1)},
dplyr_count = {count(df, X1)})
#> Unit: microseconds
#> expr min lq mean median uq max
#> tabulate_rcpp 2.959 5.9800 17.42326 7.9465 9.5435 883.561
#> base_table 48.524 59.5490 72.42985 66.3135 78.9320 153.216
#> stats_aggregate 829.324 891.7340 1069.86510 937.4070 1140.0345 2883.025
#> graphics_hist 148.561 170.5305 221.05290 188.9570 228.3160 958.619
#> janitor_tably 6005.490 6439.6870 8137.82606 7497.1985 8283.3670 53352.680
#> collapse_fnobs 14.591 21.9790 32.63891 27.2530 32.6465 417.987
#> base_tabulate 1.879 4.3310 5.68916 5.5990 6.6210 16.789
#> dplyr_count 1832.648 1969.8005 2546.17131 2350.0450 2560.3585 7210.992
#> neval
#> 100
#> 100
#> 100
#> 100
#> 100
#> 100
#> 100
#> 100
df <- data.frame(X1 = sample(1:5, 1000, replace = TRUE))
microbenchmark::microbenchmark(tabulate_rcpp = {tabulate_rcpp(df$X1, max(df$X1))},
base_table = {base::table(factor(df$X1, 1:max(df$X1)))},
stats_aggregate = {stats::aggregate(. ~ X1, cbind(df, n = 1), sum)},
graphics_hist = {hist(df$X1, plot = FALSE, right = FALSE)[c("breaks", "counts")]},
janitor_tably = {adorn_totals(tabyl(df, X1))},
collapse_fnobs = {fnobs(df, df$X1)},
base_tabulate = {tabulate(df$X1)},
dplyr_count = {count(df, X1)})
#> Unit: microseconds
#> expr min lq mean median uq max
#> tabulate_rcpp 4.847 8.8465 10.92661 10.3105 12.6785 28.407
#> base_table 83.736 107.2040 121.77962 118.8450 129.9560 184.427
#> stats_aggregate 1027.918 1155.9205 1338.27752 1246.6205 1434.8990 2085.821
#> graphics_hist 209.273 237.8265 274.60654 258.9260 300.3830 523.803
#> janitor_tably 5988.085 6497.9675 7833.34321 7593.3445 8422.6950 13759.142
#> collapse_fnobs 26.085 38.6440 51.89459 47.8250 57.3440 333.034
#> base_tabulate 4.501 6.7360 8.09408 8.2330 9.2170 11.463
#> dplyr_count 1852.290 2000.5225 2374.28205 2145.9835 2516.7940 4834.544
#> neval
#> 100
#> 100
#> 100
#> 100
#> 100
#> 100
#> 100
#> 100
由 reprex package (v2.0.0)
于 2021-08-01 创建
这里还有一个:summarytools
Martin Gal 的数据!非常感谢:
library(summarytools)
set.seed(8192)
df <- data.frame(X1 = sample(1:10, 100, replace = TRUE))
summarytools::freq(df$X1, cumul=FALSE)
输出:
Freq % Valid % Total
----------- ------ --------- ---------
1 9 9.00 9.00
2 6 6.00 6.00
3 15 15.00 15.00
4 13 13.00 13.00
5 11 11.00 11.00
6 9 9.00 9.00
7 7 7.00 7.00
8 9 9.00 9.00
9 11 11.00 11.00
10 10 10.00 10.00
<NA> 0 0.00
Total 100 100.00 100.00
我使用以下命令:
table(factor("list",levels=1:"n")
使用“列表”:(示例)a = c(1,3,4,4,3)
和 levels = 1:5
,还要考虑 2 和 5。
对于非常大的数据集,我的代码似乎非常低效。
有谁知道隐藏的库或代码片段可以使它更快?
这不是你要找的东西,但也许你可以使用这个:
library(dplyr)
set.seed(8192)
df <- data.frame(X1 = sample(1:10, 100, replace = TRUE))
df %>%
count(X1)
returns
X1 n
1 1 9
2 2 6
3 3 15
4 4 13
5 5 11
6 6 9
7 7 7
8 8 9
9 9 11
10 10 10
如果您需要计算更多数字(包括遗漏的数字),您可以使用
library(tidyr)
library(dplyr)
df2 <- data.frame(X1 = 1:12)
df %>%
count(X1) %>%
right_join(df2, by="X1") %>%
mutate(n = replace_na(n, 0L))
得到
X1 n
1 1 9
2 2 6
3 3 15
4 4 13
5 5 11
6 6 9
7 7 7
8 8 9
9 9 11
10 10 10
11 11 0
12 12 0
我们可以使用 collapse
中的 fnobs
,这会很有效
library(collapse)
fnobs(df, g = df$X1)
在base R
中,tabulate
比table
tabulate(df$X1)
[1] 9 6 15 13 11 9 7 9 11 10
我们也可以使用 janitor::tabyl
:
library(janitor)
df %>%
tabyl(X1) %>%
adorn_totals()
X1 n percent
1 9 0.09
2 6 0.06
3 15 0.15
4 13 0.13
5 11 0.11
6 9 0.09
7 7 0.07
8 9 0.09
9 11 0.11
10 10 0.10
Total 100 1.00
使用 aggregate
的基本 R 选项(从 df
)
> aggregate(. ~ X1, cbind(df, n = 1), sum)
X1 n
1 1 9
2 2 6
3 3 15
4 4 13
5 5 11
6 6 9
7 7 7
8 8 9
9 9 11
10 10 10
另一种选择是使用 hist
> hist(df$X1, plot = FALSE, right = FALSE)[c("breaks", "counts")]
$breaks
[1] 1 2 3 4 5 6 7 8 9 10
$counts
[1] 9 6 15 13 11 9 7 9 21
TL;DR 获胜者是 base::tabulate
。
总而言之,基础 objective 是一种表现,所以我准备了 microbenchmark
所有提供的解决方案。我使用小向量和大向量,两种不同的场景。对于我机器上的 collapse
包,我必须下载最新的 Rcpp
包 1.0.7(以抑制崩溃)。即使是我添加的 base::tabulate
.
suppressMessages(library(janitor))
suppressMessages(library(collapse))
suppressMessages(library(dplyr))
suppressMessages(library(cpp11))
# source
Rcpp::cppFunction('IntegerVector tabulate_rcpp(const IntegerVector& x, const unsigned max) {
IntegerVector counts(max);
for (auto& now : x) {
if (now > 0 && now <= max)
counts[now - 1]++;
}
return counts;
}')
set.seed(1234)
a = c(1,3,4,4,3)
levels = 1:5
df <- data.frame(X1 = a)
microbenchmark::microbenchmark(tabulate_rcpp = {tabulate_rcpp(df$X1, max(df$X1))},
base_table = {base::table(factor(df$X1, 1:max(df$X1)))},
stats_aggregate = {stats::aggregate(. ~ X1, cbind(df, n = 1), sum)},
graphics_hist = {hist(df$X1, plot = FALSE, right = FALSE)[c("breaks", "counts")]},
janitor_tably = {adorn_totals(tabyl(df, X1))},
collapse_fnobs = {fnobs(df, df$X1)},
base_tabulate = {tabulate(df$X1)},
dplyr_count = {count(df, X1)})
#> Unit: microseconds
#> expr min lq mean median uq max
#> tabulate_rcpp 2.959 5.9800 17.42326 7.9465 9.5435 883.561
#> base_table 48.524 59.5490 72.42985 66.3135 78.9320 153.216
#> stats_aggregate 829.324 891.7340 1069.86510 937.4070 1140.0345 2883.025
#> graphics_hist 148.561 170.5305 221.05290 188.9570 228.3160 958.619
#> janitor_tably 6005.490 6439.6870 8137.82606 7497.1985 8283.3670 53352.680
#> collapse_fnobs 14.591 21.9790 32.63891 27.2530 32.6465 417.987
#> base_tabulate 1.879 4.3310 5.68916 5.5990 6.6210 16.789
#> dplyr_count 1832.648 1969.8005 2546.17131 2350.0450 2560.3585 7210.992
#> neval
#> 100
#> 100
#> 100
#> 100
#> 100
#> 100
#> 100
#> 100
df <- data.frame(X1 = sample(1:5, 1000, replace = TRUE))
microbenchmark::microbenchmark(tabulate_rcpp = {tabulate_rcpp(df$X1, max(df$X1))},
base_table = {base::table(factor(df$X1, 1:max(df$X1)))},
stats_aggregate = {stats::aggregate(. ~ X1, cbind(df, n = 1), sum)},
graphics_hist = {hist(df$X1, plot = FALSE, right = FALSE)[c("breaks", "counts")]},
janitor_tably = {adorn_totals(tabyl(df, X1))},
collapse_fnobs = {fnobs(df, df$X1)},
base_tabulate = {tabulate(df$X1)},
dplyr_count = {count(df, X1)})
#> Unit: microseconds
#> expr min lq mean median uq max
#> tabulate_rcpp 4.847 8.8465 10.92661 10.3105 12.6785 28.407
#> base_table 83.736 107.2040 121.77962 118.8450 129.9560 184.427
#> stats_aggregate 1027.918 1155.9205 1338.27752 1246.6205 1434.8990 2085.821
#> graphics_hist 209.273 237.8265 274.60654 258.9260 300.3830 523.803
#> janitor_tably 5988.085 6497.9675 7833.34321 7593.3445 8422.6950 13759.142
#> collapse_fnobs 26.085 38.6440 51.89459 47.8250 57.3440 333.034
#> base_tabulate 4.501 6.7360 8.09408 8.2330 9.2170 11.463
#> dplyr_count 1852.290 2000.5225 2374.28205 2145.9835 2516.7940 4834.544
#> neval
#> 100
#> 100
#> 100
#> 100
#> 100
#> 100
#> 100
#> 100
由 reprex package (v2.0.0)
于 2021-08-01 创建这里还有一个:summarytools
Martin Gal 的数据!非常感谢:
library(summarytools)
set.seed(8192)
df <- data.frame(X1 = sample(1:10, 100, replace = TRUE))
summarytools::freq(df$X1, cumul=FALSE)
输出:
Freq % Valid % Total
----------- ------ --------- ---------
1 9 9.00 9.00
2 6 6.00 6.00
3 15 15.00 15.00
4 13 13.00 13.00
5 11 11.00 11.00
6 9 9.00 9.00
7 7 7.00 7.00
8 9 9.00 9.00
9 11 11.00 11.00
10 10 10.00 10.00
<NA> 0 0.00
Total 100 100.00 100.00