如何减少 Docker 容器安装 R 库的构建时间?
How to reduce build time for a Docker container installing R libraries?
我需要 运行 一些在 Docker 容器中同时包含 Python 3.8 和 R 4.1.0 的代码。下面是我的 Docker 文件。
FROM python:3.8-slim-buster AS final-image
# R version to install
ARG R_BASE_VERSION=4.1.0
ARG PREBUILD_DEPS="software-properties-common gnupg2"
ARG BUILD_DEPS="build-essential binutils cmake gfortran libblas-dev liblapack-dev libjpeg-dev libpng-dev libnlopt-dev pkg-config"
ARG RUNTIME_DEPS="r-base=${R_BASE_VERSION}-* libcurl4-openssl-dev libssl-dev libxml2-dev"
# venv path
ENV PATH="/opt/venv/bin:$PATH"
RUN apt-get update \
# Adding this to install latest versions of g++
&& echo 'deb http://deb.debian.org/debian testing main' > /etc/apt/sources.list.d/testing.list \
# Install the below packages to add repo which is then used to install R version 4
&& apt-get install -y --no-install-recommends $PREBUILD_DEPS \
&& add-apt-repository 'deb http://cloud.r-project.org/bin/linux/debian buster-cran40/'\
# This key is required to install r-base version 4
&& apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv-key FCAE2A0E115C3D8A \
# Update again to use the newly added sources
&& apt-get update \
&& apt-get install -y --no-install-recommends $RUNTIME_DEPS $BUILD_DEPS \
&& python -m venv /opt/venv \
&& /opt/venv/bin/python -m pip install --upgrade pip \
&& rm -rf /var/lib/apt/lists/*
COPY requirements.txt packages.R /
RUN pip install wheel setuptools \
&& pip install --no-cache-dir -r requirements.txt \
&& pip install --no-cache-dir --no-binary xgboost xgboost \
&& Rscript packages.R \
&& strip --strip-unneeded usr/local/lib/R/site-library/Boom/lib/libboom.a \
&& strip --strip-debug /usr/local/lib/R/site-library/*/libs/*so \
# Uninstall unnecessary dependencies
&& rm -rf /tmp/* \
&& apt-get purge -y --auto-remove $BUILD_DEPS $PREBUILD_DEPS \
&& apt-get autoremove -y \
&& apt-get autoclean -y \
&& rm -rf /var/lib/apt/lists/*
ENTRYPOINT XXX
这是 packages.R
文件:
#Setting environment
rm(list = ls())
cat("4")
print(Sys.time())
# CRAN mirror to use. cran.rstudio.com is a CDN and the recommended mirror.
# Specifying multiple backup CRAN mirrors as Jenkins builds fails
# intermittently due to unavailability of packages in main mirror.
cran_repos = c(MAIN_CRAN_MIRROR = 'https://cran.rstudio.com',
ALT_CRAN_MIRROR = 'http://cran.r-project.org/')
#Loading Libraries
package_ls <- c(
"config",
"crayon",
"aws.s3",
"aws.ec2metadata",
"dplyr",
"data.table",
"imputeTS",
"Metrics",
"StatMeasures",
"tseries",
"purrr",
"log4r",
"lubridate",
"forecast",
"caret",
"MASS",
"stringr",
"tidyr",
"uroot",
"readr",
"Boruta",
"bsts"
)
for (pkg_name in package_ls) {
message("Installing ", pkg_name)
install.packages(pkg_name, repos = cran_repos)
if (!(pkg_name %in% installed.packages()[, 'Package'])) {
stop(pkg_name,
" is a required package and it could not be installed, stopping!")
}
}
问题
构建 docker 容器花费的时间比我希望的要多得多。这是因为,一些包(例如 bsts) needs their dependencies (e.g. the C++ library Boom)要从源代码构建,这会花费很多时间。
有没有办法:
- 加快 R 库的构建速度?或者
- 在本地构建 R 库并仅将二进制文件复制到 Docker 容器。或者
- 以任何其他方式减少 R 包的构建时间。
提前致谢。
更新
评论中的一些想法:
来自 @botje
- 使用
install.packages
R function 的 Ncpus
参数并行安装 R 包。 (我有 4 个 CPU 可以使用并设置 Ncpus = 4
导致 10% 的加速。)
install.packages(package_ls, repos = cran_repos, Ncpus = 4)
- 创建包含本地编译包的自定义 CRAN 镜像以加快安装速度。
我把你的packages.R
的最后一点重写如下:
install.packages(package_ls, Ncpus=16)
与 运行 和 Ncpus=1
相比,这使我的速度提高了 3 倍(189 秒对 719 秒)。
我需要 运行 一些在 Docker 容器中同时包含 Python 3.8 和 R 4.1.0 的代码。下面是我的 Docker 文件。
FROM python:3.8-slim-buster AS final-image
# R version to install
ARG R_BASE_VERSION=4.1.0
ARG PREBUILD_DEPS="software-properties-common gnupg2"
ARG BUILD_DEPS="build-essential binutils cmake gfortran libblas-dev liblapack-dev libjpeg-dev libpng-dev libnlopt-dev pkg-config"
ARG RUNTIME_DEPS="r-base=${R_BASE_VERSION}-* libcurl4-openssl-dev libssl-dev libxml2-dev"
# venv path
ENV PATH="/opt/venv/bin:$PATH"
RUN apt-get update \
# Adding this to install latest versions of g++
&& echo 'deb http://deb.debian.org/debian testing main' > /etc/apt/sources.list.d/testing.list \
# Install the below packages to add repo which is then used to install R version 4
&& apt-get install -y --no-install-recommends $PREBUILD_DEPS \
&& add-apt-repository 'deb http://cloud.r-project.org/bin/linux/debian buster-cran40/'\
# This key is required to install r-base version 4
&& apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv-key FCAE2A0E115C3D8A \
# Update again to use the newly added sources
&& apt-get update \
&& apt-get install -y --no-install-recommends $RUNTIME_DEPS $BUILD_DEPS \
&& python -m venv /opt/venv \
&& /opt/venv/bin/python -m pip install --upgrade pip \
&& rm -rf /var/lib/apt/lists/*
COPY requirements.txt packages.R /
RUN pip install wheel setuptools \
&& pip install --no-cache-dir -r requirements.txt \
&& pip install --no-cache-dir --no-binary xgboost xgboost \
&& Rscript packages.R \
&& strip --strip-unneeded usr/local/lib/R/site-library/Boom/lib/libboom.a \
&& strip --strip-debug /usr/local/lib/R/site-library/*/libs/*so \
# Uninstall unnecessary dependencies
&& rm -rf /tmp/* \
&& apt-get purge -y --auto-remove $BUILD_DEPS $PREBUILD_DEPS \
&& apt-get autoremove -y \
&& apt-get autoclean -y \
&& rm -rf /var/lib/apt/lists/*
ENTRYPOINT XXX
这是 packages.R
文件:
#Setting environment
rm(list = ls())
cat("4")
print(Sys.time())
# CRAN mirror to use. cran.rstudio.com is a CDN and the recommended mirror.
# Specifying multiple backup CRAN mirrors as Jenkins builds fails
# intermittently due to unavailability of packages in main mirror.
cran_repos = c(MAIN_CRAN_MIRROR = 'https://cran.rstudio.com',
ALT_CRAN_MIRROR = 'http://cran.r-project.org/')
#Loading Libraries
package_ls <- c(
"config",
"crayon",
"aws.s3",
"aws.ec2metadata",
"dplyr",
"data.table",
"imputeTS",
"Metrics",
"StatMeasures",
"tseries",
"purrr",
"log4r",
"lubridate",
"forecast",
"caret",
"MASS",
"stringr",
"tidyr",
"uroot",
"readr",
"Boruta",
"bsts"
)
for (pkg_name in package_ls) {
message("Installing ", pkg_name)
install.packages(pkg_name, repos = cran_repos)
if (!(pkg_name %in% installed.packages()[, 'Package'])) {
stop(pkg_name,
" is a required package and it could not be installed, stopping!")
}
}
问题
构建 docker 容器花费的时间比我希望的要多得多。这是因为,一些包(例如 bsts) needs their dependencies (e.g. the C++ library Boom)要从源代码构建,这会花费很多时间。 有没有办法:
- 加快 R 库的构建速度?或者
- 在本地构建 R 库并仅将二进制文件复制到 Docker 容器。或者
- 以任何其他方式减少 R 包的构建时间。
提前致谢。
更新
评论中的一些想法:
来自 @botje
- 使用
install.packages
R function 的Ncpus
参数并行安装 R 包。 (我有 4 个 CPU 可以使用并设置Ncpus = 4
导致 10% 的加速。)
install.packages(package_ls, repos = cran_repos, Ncpus = 4)
- 创建包含本地编译包的自定义 CRAN 镜像以加快安装速度。
我把你的packages.R
的最后一点重写如下:
install.packages(package_ls, Ncpus=16)
与 运行 和 Ncpus=1
相比,这使我的速度提高了 3 倍(189 秒对 719 秒)。