如何减小 r 中预处理配方对象的大小?
How to reduce the size of a preprocessing recipe object in r?
我正在使用 R 食谱包预处理数据集,进行 Yeo-Johnson 转换以使其更符合正态分布,然后进行缩放以使其标准化。之后我想减小食谱对象的大小,我使用了 butcher 包。但这没有帮助。我还尝试手动清理存储数据的 'template',但大小再次保持不变。知道如何减小存储和以后使用的大小吗?这是我面临的一个现实问题的例子:
suppressPackageStartupMessages({
library(dplyr)
library(purrr)
library(recipes)
})
#Lets generate skewed numeric data of size 20 000 x 3 000 (originally I am working with 10x more rows)
n <- 3000
example_list <-
1:n %>%
map(~abs(rnorm(n = 20000, mean = 0, sd = sample(seq(0.1, 10, length.out = n), size = n))))
names(example_list) <- paste0("col_", 1:n)
example_tibble <- as_tibble(example_list)
#Lets create preprocessing recipe
new_recipe <-
recipe( ~ ., data = example_tibble) %>%
step_YeoJohnson(all_numeric()) %>%
step_normalize(all_numeric()) %>%
prep(strings_as_factors = FALSE, retain = FALSE)
#Lets check the structure and size of the recipe object
butcher::weigh(new_recipe)
#> # A tibble: 9,034 x 2
#> object size
#> <chr> <dbl>
#> 1 steps.terms 480.
#> 2 steps.terms 480.
#> 3 steps.lambdas 0.232
#> 4 steps.means 0.232
#> 5 steps.sds 0.232
#> 6 var_info.variable 0.208
#> 7 term_info.variable 0.208
#> 8 last_term_info.variable 0.208
#> 9 template.col_1 0.160
#> 10 template.col_2 0.160
#> # … with 9,024 more rows
lobstr::obj_size(new_recipe)
#> 481,649,536 B
#Lets try to remove unnecessary parts of the object
new_recipe_butchered <- butcher::butcher(new_recipe, verbose = TRUE)
#> ✖ No memory released. Do not butcher.
#Lets check again the size
lobstr::obj_size(new_recipe_butchered)
#> 481,650,016 B
butcher::weigh(new_recipe_butchered)
#> # A tibble: 9,034 x 2
#> object size
#> <chr> <dbl>
#> 1 steps.terms 480.
#> 2 steps.lambdas 0.232
#> 3 steps.means 0.232
#> 4 steps.sds 0.232
#> 5 var_info.variable 0.208
#> 6 term_info.variable 0.208
#> 7 last_term_info.variable 0.208
#> 8 template.col_1 0.160
#> 9 template.col_2 0.160
#> 10 template.col_3 0.160
#> # … with 9,024 more rows
#Lets try to remove the template with data
new_recipe_butchered$template <- NULL
butcher::weigh(new_recipe_butchered)
#> # A tibble: 6,034 x 2
#> object size
#> <chr> <dbl>
#> 1 steps.terms 480.
#> 2 steps.lambdas 0.232
#> 3 steps.means 0.232
#> 4 steps.sds 0.232
#> 5 var_info.variable 0.208
#> 6 term_info.variable 0.208
#> 7 last_term_info.variable 0.208
#> 8 var_info.role 0.0241
#> 9 var_info.source 0.0241
#> 10 term_info.role 0.0241
#> # … with 6,024 more rows
#Lets check again the size - still the same
lobstr::obj_size(new_recipe_butchered)
#> 481,650,016 B
由 reprex package (v0.3.0)
创建于 2021-06-17
我好像无法缩小尺寸,有人可以帮忙吗?
此问题已在开发版 {butcher} 中得到解决,您可以下载该版本
# install.packages("devtools")
devtools::install_github("tidymodels/butcher")
{butcher} 现在将从步骤中删除 terms
环境。
suppressPackageStartupMessages({
library(dplyr)
library(purrr)
library(recipes)
})
n <- 3000
example_list <-
1:n %>%
map(~abs(rnorm(n = 20000, mean = 0, sd = sample(seq(0.1, 10, length.out = n), size = n))))
names(example_list) <- paste0("col_", 1:n)
example_tibble <- as_tibble(example_list)
new_recipe <-
recipe( ~ ., data = example_tibble) %>%
step_YeoJohnson(all_numeric()) %>%
step_normalize(all_numeric()) %>%
prep(strings_as_factors = FALSE, retain = FALSE)
butcher::weigh(new_recipe)
#> # A tibble: 12,033 x 2
#> object size
#> <chr> <dbl>
#> 1 steps.terms 480.
#> 2 steps.terms 480.
#> 3 steps.lambdas 0.232
#> 4 steps.means 0.232
#> 5 steps.sds 0.232
#> 6 var_info.variable 0.208
#> 7 term_info.variable 0.208
#> 8 last_term_info.variable 0.208
#> 9 var_info.role 0.0241
#> 10 var_info.source 0.0241
#> # … with 12,023 more rows
lobstr::obj_size(new_recipe)
#> 481,985,880 B
new_recipe_butchered <- butcher::butcher(new_recipe, verbose = TRUE)
#> ✓ Memory released: '480,170,888 B'
lobstr::obj_size(new_recipe_butchered)
#> 1,814,992 B
butcher::weigh(new_recipe_butchered)
#> # A tibble: 12,033 x 2
#> object size
#> <chr> <dbl>
#> 1 steps.lambdas 0.232
#> 2 steps.means 0.232
#> 3 steps.sds 0.232
#> 4 var_info.variable 0.208
#> 5 term_info.variable 0.208
#> 6 last_term_info.variable 0.208
#> 7 var_info.role 0.0241
#> 8 var_info.source 0.0241
#> 9 term_info.role 0.0241
#> 10 term_info.source 0.0241
#> # … with 12,023 more rows
由 reprex package (v2.0.0)
于 2021-06-17 创建
我正在使用 R 食谱包预处理数据集,进行 Yeo-Johnson 转换以使其更符合正态分布,然后进行缩放以使其标准化。之后我想减小食谱对象的大小,我使用了 butcher 包。但这没有帮助。我还尝试手动清理存储数据的 'template',但大小再次保持不变。知道如何减小存储和以后使用的大小吗?这是我面临的一个现实问题的例子:
suppressPackageStartupMessages({
library(dplyr)
library(purrr)
library(recipes)
})
#Lets generate skewed numeric data of size 20 000 x 3 000 (originally I am working with 10x more rows)
n <- 3000
example_list <-
1:n %>%
map(~abs(rnorm(n = 20000, mean = 0, sd = sample(seq(0.1, 10, length.out = n), size = n))))
names(example_list) <- paste0("col_", 1:n)
example_tibble <- as_tibble(example_list)
#Lets create preprocessing recipe
new_recipe <-
recipe( ~ ., data = example_tibble) %>%
step_YeoJohnson(all_numeric()) %>%
step_normalize(all_numeric()) %>%
prep(strings_as_factors = FALSE, retain = FALSE)
#Lets check the structure and size of the recipe object
butcher::weigh(new_recipe)
#> # A tibble: 9,034 x 2
#> object size
#> <chr> <dbl>
#> 1 steps.terms 480.
#> 2 steps.terms 480.
#> 3 steps.lambdas 0.232
#> 4 steps.means 0.232
#> 5 steps.sds 0.232
#> 6 var_info.variable 0.208
#> 7 term_info.variable 0.208
#> 8 last_term_info.variable 0.208
#> 9 template.col_1 0.160
#> 10 template.col_2 0.160
#> # … with 9,024 more rows
lobstr::obj_size(new_recipe)
#> 481,649,536 B
#Lets try to remove unnecessary parts of the object
new_recipe_butchered <- butcher::butcher(new_recipe, verbose = TRUE)
#> ✖ No memory released. Do not butcher.
#Lets check again the size
lobstr::obj_size(new_recipe_butchered)
#> 481,650,016 B
butcher::weigh(new_recipe_butchered)
#> # A tibble: 9,034 x 2
#> object size
#> <chr> <dbl>
#> 1 steps.terms 480.
#> 2 steps.lambdas 0.232
#> 3 steps.means 0.232
#> 4 steps.sds 0.232
#> 5 var_info.variable 0.208
#> 6 term_info.variable 0.208
#> 7 last_term_info.variable 0.208
#> 8 template.col_1 0.160
#> 9 template.col_2 0.160
#> 10 template.col_3 0.160
#> # … with 9,024 more rows
#Lets try to remove the template with data
new_recipe_butchered$template <- NULL
butcher::weigh(new_recipe_butchered)
#> # A tibble: 6,034 x 2
#> object size
#> <chr> <dbl>
#> 1 steps.terms 480.
#> 2 steps.lambdas 0.232
#> 3 steps.means 0.232
#> 4 steps.sds 0.232
#> 5 var_info.variable 0.208
#> 6 term_info.variable 0.208
#> 7 last_term_info.variable 0.208
#> 8 var_info.role 0.0241
#> 9 var_info.source 0.0241
#> 10 term_info.role 0.0241
#> # … with 6,024 more rows
#Lets check again the size - still the same
lobstr::obj_size(new_recipe_butchered)
#> 481,650,016 B
由 reprex package (v0.3.0)
创建于 2021-06-17我好像无法缩小尺寸,有人可以帮忙吗?
此问题已在开发版 {butcher} 中得到解决,您可以下载该版本
# install.packages("devtools")
devtools::install_github("tidymodels/butcher")
{butcher} 现在将从步骤中删除 terms
环境。
suppressPackageStartupMessages({
library(dplyr)
library(purrr)
library(recipes)
})
n <- 3000
example_list <-
1:n %>%
map(~abs(rnorm(n = 20000, mean = 0, sd = sample(seq(0.1, 10, length.out = n), size = n))))
names(example_list) <- paste0("col_", 1:n)
example_tibble <- as_tibble(example_list)
new_recipe <-
recipe( ~ ., data = example_tibble) %>%
step_YeoJohnson(all_numeric()) %>%
step_normalize(all_numeric()) %>%
prep(strings_as_factors = FALSE, retain = FALSE)
butcher::weigh(new_recipe)
#> # A tibble: 12,033 x 2
#> object size
#> <chr> <dbl>
#> 1 steps.terms 480.
#> 2 steps.terms 480.
#> 3 steps.lambdas 0.232
#> 4 steps.means 0.232
#> 5 steps.sds 0.232
#> 6 var_info.variable 0.208
#> 7 term_info.variable 0.208
#> 8 last_term_info.variable 0.208
#> 9 var_info.role 0.0241
#> 10 var_info.source 0.0241
#> # … with 12,023 more rows
lobstr::obj_size(new_recipe)
#> 481,985,880 B
new_recipe_butchered <- butcher::butcher(new_recipe, verbose = TRUE)
#> ✓ Memory released: '480,170,888 B'
lobstr::obj_size(new_recipe_butchered)
#> 1,814,992 B
butcher::weigh(new_recipe_butchered)
#> # A tibble: 12,033 x 2
#> object size
#> <chr> <dbl>
#> 1 steps.lambdas 0.232
#> 2 steps.means 0.232
#> 3 steps.sds 0.232
#> 4 var_info.variable 0.208
#> 5 term_info.variable 0.208
#> 6 last_term_info.variable 0.208
#> 7 var_info.role 0.0241
#> 8 var_info.source 0.0241
#> 9 term_info.role 0.0241
#> 10 term_info.source 0.0241
#> # … with 12,023 more rows
由 reprex package (v2.0.0)
于 2021-06-17 创建