如何为 Ruby Capybara scraper 编写 Dockerfile?

how to write the Dockerfile for a Ruby Capybara scraper?

我正在尝试将 Docker 文件写入 运行 容器上的 Ruby Capybara 刮刀。我在主机 OS 上测试了以下代码。但是它在 docker 容器上出错。

Docker文件

FROM ruby:2.6.6

RUN apt-get update -y && \
apt-get install -y xvfb

RUN wget https://ftp.mozilla.org/pub/firefox/releases/80.0.1/linux-x86_64/en-US/firefox-80.0.1.tar.bz2
RUN tar -xjf firefox-80.0.1.tar.bz2
RUN mv firefox /opt/firefox80
RUN ln -s /opt/firefox80/firefox /usr/bin/firefox
RUN ls /opt/firefox80

RUN wget -N https://github.com/mozilla/geckodriver/releases/download/v0.27.0/geckodriver-v0.27.0-linux64.tar.gz
RUN tar -xvzf geckodriver-v0.27.0-linux64.tar.gz
RUN chmod +x geckodriver
RUN mv -f geckodriver /usr/local/share/geckodriver
RUN ln -s /usr/local/share/geckodriver /usr/local/bin/geckodriver
RUN ln -s /usr/local/share/geckodriver /usr/bin/geckodriver
RUN mkdir capybara
WORKDIR /capybara/
COPY . /capybara

RUN bundle install

main.rb

require 'capybara'
require 'capybara/dsl'
require 'selenium-webdriver'

include Capybara::DSL

Capybara.register_driver :selenium_headless_firefox do |app|
  browser_options = ::Selenium::WebDriver::Firefox::Options.new()
  browser_options.args << '--headless'

  Capybara::Selenium::Driver.new(
    app,
    browser: :firefox,
    options: browser_options
  )
end

target = "https://maps.google.com/?cid=13666314335012854449"

session = Capybara::Session.new(:selenium_headless_firefox)
session.visit(target)

宝石文件

source 'https://rubygems.org'

gem 'selenium-webdriver'
gem 'capybara', '~>3.30'
gem 'geckodriver-helper'

Docker

上的错误消息
/usr/local/bundle/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/remote/response.rb:72:in `assert_ok': invalid argument: can't kill an exited process (Selenium::WebDriver::Error::UnknownError)
        from /usr/local/bundle/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/remote/response.rb:34:in `initialize'
        from /usr/local/bundle/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/remote/http/common.rb:88:in `new'
        from /usr/local/bundle/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/remote/http/common.rb:88:in `create_response'
        from /usr/local/bundle/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/remote/http/default.rb:114:in `request'
        from /usr/local/bundle/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/remote/http/common.rb:64:in `call'
        from /usr/local/bundle/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/remote/bridge.rb:167:in `execute'
        from /usr/local/bundle/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/remote/bridge.rb:102:in `create_session'
        from /usr/local/bundle/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/firefox/marionette/driver.rb:44:in `initialize'
        from /usr/local/bundle/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/firefox/driver.rb:33:in `new'
        from /usr/local/bundle/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/firefox/driver.rb:33:in `new'
        from /usr/local/bundle/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/common/driver.rb:54:in `for'
        from /usr/local/bundle/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver.rb:88:in `for'
        from /usr/local/bundle/gems/capybara-3.33.0/lib/capybara/selenium/driver.rb:52:in `browser'
        from /usr/local/bundle/gems/capybara-3.33.0/lib/capybara/selenium/driver.rb:71:in `visit'
        from /usr/local/bundle/gems/capybara-3.33.0/lib/capybara/session.rb:278:in `visit'

这是我在 docker 容器上 运行 文件时得到的结果。我期待开发者社区的任何帮助。

我 运行 main.rb 文件由 docker run [docker_image] ruby main.rb

问题不在于 Capybara,而在于 Firefox - 您下载的 tar.bz2 文件不包含其依赖项,这会导致它崩溃。最简单的解决方案是通过 apt 安装它。假设您所有的文件都在同一目录中,您的 Dockerfile 应该如下所示:

FROM ruby:2.6.6

WORKDIR /app

COPY . .

RUN apt-get update -y && \
    apt-get install -y xvfb firefox-esr && \
    wget -N https://github.com/mozilla/geckodriver/releases/download/v0.27.0/geckodriver-v0.27.0-linux64.tar.gz && \
    tar -xvzf geckodriver-v0.27.0-linux64.tar.gz && \
    chmod +x geckodriver && \
    mv -f geckodriver /usr/local/share/geckodriver && \
    ln -s /usr/local/share/geckodriver /usr/local/bin/geckodriver && \
    ln -s /usr/local/share/geckodriver /usr/bin/geckodriver && \
    bundle install && \
    apt-get clean && \
    rm geckodriver-v0.27.0-linux64.tar.gz && \
    rm -rf /var/lib/apt/lists/*

CMD [ "ruby", "/app/main.rb" ]

那么你可以运行:

docker build -t capybara:latest . # Build image
docker run -it --rm --env DISPLAY=$DISPLAY --volume="$HOME/.Xauthority:/root/.Xauthority:rw" --net=host capybara:latest firefox # Verify Firefox works
docker run -it --rm capybara:latest # Run your script

注意:第二个命令仅适用于 Linux,运行在 Windows 上使用 dockerized Linux GUI 应用程序有点困难,需要一些额外的设置.

编辑:

没有安装什么东西 “在 Docker 上”。 Docker 不是 OS。它是一个应用程序容器化框架。它可以 运行 容器内的各种操作系统(或者根本没有 OS - 参见 base image)。这意味着在 Docker 图像(或容器 - 不推荐)中安装某些东西的方法取决于已经安装的东西。

在这种情况下,您的基础映像 ruby:2.6.6 基于 Debian Buster 映像(参见 Dockerfile),因此您需要按照在常规桌面或服务器上的方式安装所需的浏览器安装系统。

Debian Buster 没有 Chrome,因为它不是开源的。要安装它的开源等效项 - Chromium - 如下修改您的 Dockerfile

FROM ruby:2.6.6

WORKDIR /app

COPY . .

RUN apt-get update -y && \
    apt-get install -y xvfb chromium && \
    wget -N https://github.com/mozilla/geckodriver/releases/download/v0.27.0/geckodriver-v0.27.0-linux64.tar.gz && \
    tar -xvzf geckodriver-v0.27.0-linux64.tar.gz && \
    chmod +x geckodriver && \
    mv -f geckodriver /usr/local/share/geckodriver && \
    ln -s /usr/local/share/geckodriver /usr/local/bin/geckodriver && \
    ln -s /usr/local/share/geckodriver /usr/bin/geckodriver && \
    bundle install && \
    apt-get clean && \
    rm geckodriver-v0.27.0-linux64.tar.gz && \
    rm -rf /var/lib/apt/lists/*

CMD [ "ruby", "/app/main.rb" ] 

如果您确实需要Chrome,请按照官方documentation(请记住您需要在安装后删除存档文件)。话虽如此,Chrome 的 Dockerfile 将是:

FROM ruby:2.6.6

WORKDIR /app

COPY . .

RUN apt-get update -y && \
    apt-get install -y xvfb && \
    wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb && \
    apt install -y ./google-chrome-stable_current_amd64.deb && \
    wget -N https://github.com/mozilla/geckodriver/releases/download/v0.27.0/geckodriver-v0.27.0-linux64.tar.gz && \
    tar -xvzf geckodriver-v0.27.0-linux64.tar.gz && \
    chmod +x geckodriver && \
    mv -f geckodriver /usr/local/share/geckodriver && \
    ln -s /usr/local/share/geckodriver /usr/local/bin/geckodriver && \
    ln -s /usr/local/share/geckodriver /usr/bin/geckodriver && \
    bundle install && \
    apt-get clean && \
    rm google-chrome-stable_current_amd64.deb && \
    rm geckodriver-v0.27.0-linux64.tar.gz && \
    rm -rf /var/lib/apt/lists/*

CMD [ "ruby", "/app/main.rb" ]