在 Windows Server 2019 上安装 Scrapy,在 Docker 容器中 运行

Install Scrapy on Windows Server 2019, running in a Docker container

我想在 Windows Server 2019 上安装 Scrapy,运行 宁在 Docker 容器中(请参阅 and 了解我的安装历史)。

在我的本地 Windows 10 机器上,我可以 运行 我的 Scrapy 命令,就像在 Windows PowerShell 中一样(在简单启动 Docker 桌面之后): scrapy crawl myscraper -o allobjects.json 在文件夹 C:\scrapy\my1stscraper\

对于此处推荐的 Windows 服务器,我首先按照以下步骤安装了 Anaconda:https://docs.scrapy.org/en/latest/intro/install.html.

然后我打开 Anaconda 提示符并在 D:\Programs

中键入 conda install -c conda-forge scrapy
(base) PS D:\Programs> dir


    Directory: D:\Programs


Mode                LastWriteTime         Length Name
----                -------------         ------ ----
d-----        4/22/2021  10:52 AM                Anaconda3
-a----        4/22/2021  11:20 AM              0 conda


(base) PS D:\Programs> conda install -c conda-forge scrapy
Collecting package metadata (current_repodata.json): done
Solving environment: done


==> WARNING: A newer version of conda exists. <==
  current version: 4.9.2
  latest version: 4.10.1

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: D:\Programs\Anaconda3

  added / updated specs:
    - scrapy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    automat-20.2.0             |             py_0          30 KB  conda-forge
    conda-4.10.1               |   py38haa244fe_0         3.1 MB  conda-forge
    constantly-15.1.0          |             py_0           9 KB  conda-forge
    cssselect-1.1.0            |             py_0          18 KB  conda-forge
    hyperlink-21.0.0           |     pyhd3deb0d_0          71 KB  conda-forge
    incremental-17.5.0         |             py_0          14 KB  conda-forge
    itemadapter-0.2.0          |     pyhd8ed1ab_0          12 KB  conda-forge
    parsel-1.6.0               |             py_0          15 KB  conda-forge
    pyasn1-0.4.8               |             py_0          53 KB  conda-forge
    pyasn1-modules-0.2.7       |             py_0          60 KB  conda-forge
    pydispatcher-2.0.5         |             py_1          12 KB  conda-forge
    pyhamcrest-2.0.2           |             py_0          29 KB  conda-forge
    python_abi-3.8             |           1_cp38           4 KB  conda-forge
    queuelib-1.6.1             |     pyhd8ed1ab_0          14 KB  conda-forge
    scrapy-2.4.1               |   py38haa95532_0         372 KB
    service_identity-18.1.0    |             py_0          12 KB  conda-forge
    twisted-21.2.0             |   py38h294d835_0         5.1 MB  conda-forge
    twisted-iocpsupport-1.0.1  |   py38h294d835_0          49 KB  conda-forge
    w3lib-1.22.0               |     pyh9f0ad1d_0          21 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         9.0 MB

The following NEW packages will be INSTALLED:

  automat            conda-forge/noarch::automat-20.2.0-py_0
  constantly         conda-forge/noarch::constantly-15.1.0-py_0
  cssselect          conda-forge/noarch::cssselect-1.1.0-py_0
  hyperlink          conda-forge/noarch::hyperlink-21.0.0-pyhd3deb0d_0
  incremental        conda-forge/noarch::incremental-17.5.0-py_0
  itemadapter        conda-forge/noarch::itemadapter-0.2.0-pyhd8ed1ab_0
  parsel             conda-forge/noarch::parsel-1.6.0-py_0
  pyasn1             conda-forge/noarch::pyasn1-0.4.8-py_0
  pyasn1-modules     conda-forge/noarch::pyasn1-modules-0.2.7-py_0
  pydispatcher       conda-forge/noarch::pydispatcher-2.0.5-py_1
  pyhamcrest         conda-forge/noarch::pyhamcrest-2.0.2-py_0
  python_abi         conda-forge/win-64::python_abi-3.8-1_cp38
  queuelib           conda-forge/noarch::queuelib-1.6.1-pyhd8ed1ab_0
  scrapy             pkgs/main/win-64::scrapy-2.4.1-py38haa95532_0
  service_identity   conda-forge/noarch::service_identity-18.1.0-py_0
  twisted            conda-forge/win-64::twisted-21.2.0-py38h294d835_0
  twisted-iocpsuppo~ conda-forge/win-64::twisted-iocpsupport-1.0.1-py38h294d835_0
  w3lib              conda-forge/noarch::w3lib-1.22.0-pyh9f0ad1d_0

The following packages will be UPDATED:

  conda               pkgs/main::conda-4.9.2-py38haa95532_0 --> conda-forge::conda-4.10.1-py38haa244fe_0


Proceed ([y]/n)? y


Downloading and Extracting Packages
constantly-15.1.0    | 9 KB      | ############################################################################ | 100%
itemadapter-0.2.0    | 12 KB     | ############################################################################ | 100%
twisted-21.2.0       | 5.1 MB    | ############################################################################ | 100%
pydispatcher-2.0.5   | 12 KB     | ############################################################################ | 100%
queuelib-1.6.1       | 14 KB     | ############################################################################ | 100%
service_identity-18. | 12 KB     | ############################################################################ | 100%
pyhamcrest-2.0.2     | 29 KB     | ############################################################################ | 100%
cssselect-1.1.0      | 18 KB     | ############################################################################ | 100%
automat-20.2.0       | 30 KB     | ############################################################################ | 100%
pyasn1-0.4.8         | 53 KB     | ############################################################################ | 100%
twisted-iocpsupport- | 49 KB     | ############################################################################ | 100%
python_abi-3.8       | 4 KB      | ############################################################################ | 100%
hyperlink-21.0.0     | 71 KB     | ############################################################################ | 100%
conda-4.10.1         | 3.1 MB    | ############################################################################ | 100%
scrapy-2.4.1         | 372 KB    | ############################################################################ | 100%
incremental-17.5.0   | 14 KB     | ############################################################################ | 100%
w3lib-1.22.0         | 21 KB     | ############################################################################ | 100%
pyasn1-modules-0.2.7 | 60 KB     | ############################################################################ | 100%
parsel-1.6.0         | 15 KB     | ############################################################################ | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
(base) PS D:\Programs>

在我的 VPS 上的 PowerShell 中,然后我尝试通过 D:\Programs\Anaconda3\Scripts\scrapy.exe

运行 scrapy

我想要 运行 我存储在文件夹 D:\scrapy\my1stscraper 中的蜘蛛,请参阅:

Docker Engine 服务 运行 宁作为 Windows 服务(假设我不需要在 运行宁我的 scrapy 命令时显式启动容器,如果我这样做,我不知道如何):

我试过像这样启动我的抓取工具 D:\Programs\Anaconda3\Scripts\scrapy.exe crawl D:\scrapy\my1stscraper\spiders\my1stscraper -o allobjects.json,导致错误:

Traceback (most recent call last):
  File "D:\Programs\Anaconda3\Scripts\scrapy-script.py", line 6, in <module>
    from scrapy.cmdline import execute
  File "D:\Programs\Anaconda3\lib\site-packages\scrapy\__init__.py", line 12, in <module>
    from scrapy.spiders import Spider
  File "D:\Programs\Anaconda3\lib\site-packages\scrapy\spiders\__init__.py", line 11, in <module>
    from scrapy.http import Request
  File "D:\Programs\Anaconda3\lib\site-packages\scrapy\http\__init__.py", line 11, in <module>
    from scrapy.http.request.form import FormRequest
  File "D:\Programs\Anaconda3\lib\site-packages\scrapy\http\request\form.py", line 10, in <module>
    import lxml.html
  File "D:\Programs\Anaconda3\lib\site-packages\lxml\html\__init__.py", line 53, in <module>
    from .. import etree
ImportError: DLL load failed while importing etree: The specified module could not be found.

我在这里查看: from lxml import etree ImportError: DLL load failed: The specified module could not be found

这里讨论的是 pip,我没有使用它,但可以肯定的是我确实安装了 C++ 构建工具:

我仍然遇到同样的错误。我如何在 Docker 容器中 运行 我的 Scrapy 爬虫?

更新 1

我的 VPS 是我唯一的环境,所以不确定如何在虚拟环境中进行测试。

我现在做了什么:

查看您的建议:

Get steps to manually install the app on Windows Server - ideally test in a virtualised environment so you can reset it cleanly

  1. 当你说应用程序时,你是什么意思?垃圾?康达?

Convert all steps to a fully automatic powershell script (e.g. for conda, need to download the installer via wget, execute the installer etc.

  1. 我现在在主机 OS 上安装了 Conda,因为我认为这将使我的开销最少。或者直接安装在镜像中,如果是这样,我怎么不用每次都安装它?

  2. 最后,为了确定,我想 运行 多个 Scrapy 抓取器,但我想以尽可能少的开销来做到这一点。 我应该在同一个 docker 容器中为每个我想执行的爬虫重复 RUN 命令,对吗?

更新 2

whomami确实returnsuser manager\containeradministrator

scrapy benchmark returns

Scrapy 2.4.1 - no active project
Unknown command: benchmark
Use "scrapy" to see available commands

我在文件夹 D:\scrapy\my1stscraper 中有我想要 运行 的 scrapy 项目,我如何 运行 该项目,因为 D:\ 驱动器在我的容器中不可用?

更新 3

几个月后,当我们讨论这个问题时,当我现在 运行 你提出 Docker 文件时,它中断了,我现在得到这个输出:

PS D:\Programs> docker build . -t scrapy
Sending build context to Docker daemon  1.644GB
Step 1/9 : FROM mcr.microsoft.com/windows/servercore:ltsc2019
 ---> d1724c2d9a84
Step 2/9 : SHELL ["powershell", "-Command", "$ErrorActionPreference = 'Stop'; $ProgressPreference = 'SilentlyContinue';"]
 ---> Running in 5f79f1bf9b62
Removing intermediate container 5f79f1bf9b62
 ---> 8bb2a477eaca
Step 3/9 : RUN setx /M PATH $('C:\Users\ContainerAdministrator\miniconda3\Library\bin;C:\Users\ContainerAdministrator\miniconda3\Scripts;C:\Users\ContainerAdministrator\miniconda3;' + $Env:PATH)
 ---> Running in f3869c4f64d5

SUCCESS: Specified value was saved.
Removing intermediate container f3869c4f64d5
 ---> 82a2fa969a88
Step 4/9 : RUN Invoke-WebRequest "https://repo.anaconda.com/miniconda/Miniconda3-latest-Windows-x86_64.exe" -OutFile miniconda3.exe -UseBasicParsing;     Start-Process -FilePath 'miniconda3.exe' -Wait -ArgumentList '/S', '/D=C:\Users\ContainerAdministrator\miniconda3';     Remove-Item .\miniconda3.exe;     conda install -y -c conda-forge scrapy;
 ---> Running in 3eb8b7bfe878
Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... failed with initial frozen solve. Retrying with flexible solve.
Solving environment: ...working... failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... failed with initial frozen solve. Retrying with flexible solve.

Found conflicts! Looking for incompatible packages.
This can take several minutes.  Press CTRL-C to abort.
failed

UnsatisfiableError: The following specifications were found to be incompatible with the existing python installation in your environment:

Specifications:

  - scrapy -> python[version='2.7.*|3.5.*|3.6.*|>=2.7,<2.8.0a0|>=3.6,<3.7.0a0|>=3.7,<3.8.0a0|>=3.8,<3.9.0a0|>=3.5,<3.6.0a0|3.4.*']

Your python: python=3.9

If python is on the left-most side of the chain, that's the version you've asked for.
When python appears to the right, that indicates that the thing on the left is somehow
not available for the python version you are constrained to. Note that conda will not
change your python version to a different minor version unless you explicitly specify
that.

不确定我是否读对了,但似乎 Scrapy 不支持 Python 3.9,除了这里我看到“Scrapy 需要 Python 3.6+” https://docs.scrapy.org/en/latest/intro/install.html 你知道是什么导致了这个问题吗?我也 checked here 但也没有回答

对于 运行 容器化应用程序,必须先将其安装 在容器映像中 - 您不想在主机上安装任何软件。

对于 linux,有现成的容器镜像可以满足您的 docker 桌面环境所使用的所有内容;我在 docker 集线器搜索 scrapy 上看到 1051 个结果,但其中 none 是 windows 个容器.

从头开始为应用创建 windows 容器的完整过程是:

  • 获取在 Windows 服务器上手动安装应用程序(scrapy 及其依赖项)的步骤 - 理想情况下在虚拟化环境中进行测试,以便您可以彻底重置它
  • 将所有步骤转换为全自动电源shell脚本(例如conda,需要通过wget下载安装程序,执行安装程序等
  • 可选地,在交互式容器中测试 powershell 步骤
    • docker run -it --isolation=process mcr.microsoft.com/windows/servercore:ltsc2019 powershell
    • 这个 运行 是一个 windows 容器,并为您提供 shell 来验证您的安装脚本是否有效
    • 当您退出 shell 容器停止时
  • 创建 Dockerfile
    • 通过FROM
    • 使用mcr.microsoft.com/windows/servercore:ltsc2019作为基础图像
    • 对你的每行电源使用RUN命令shell脚本

我尝试在使用 conda / python 3.6 的现有 windows Dockerfile 上安装 scrapy,它在类似阶段抛出错误 SettingsFrame has no attribute 'ENABLE_CONNECT_PROTOCOL'

但是我再次尝试使用 miniconda 和 python 3.8,并且能够获得 scrapy 运行ning,这是 docker 文件:

FROM mcr.microsoft.com/windows/servercore:ltsc2019

SHELL ["powershell", "-Command", "$ErrorActionPreference = 'Stop'; $ProgressPreference = 'SilentlyContinue';"]

RUN setx /M PATH $('C:\Users\ContainerAdministrator\miniconda3\Library\bin;C:\Users\ContainerAdministrator\miniconda3\Scripts;C:\Users\ContainerAdministrator\miniconda3;' + $Env:PATH)
RUN Invoke-WebRequest "https://repo.anaconda.com/miniconda/Miniconda3-py38_4.10.3-Windows-x86_64.exe" -OutFile miniconda3.exe -UseBasicParsing; \
    Start-Process -FilePath 'miniconda3.exe' -Wait -ArgumentList '/S', '/D=C:\Users\ContainerAdministrator\miniconda3'; \
    Remove-Item .\miniconda3.exe; \
    conda install -y -c conda-forge scrapy;

使用 docker build . -t scrapy 和 运行 使用 docker run -it scrapy 构建它。

为了验证你在容器 运行 whoami 中 运行ning 一个 shell - 应该 return user manager\containeradministrator.

然后,scrapy benchmark 命令应该能够 运行 并转储一些统计数据。 当您关闭 shell.

时容器将停止