将 spaCy 模型训练为 Vertex AI 管道 "Component"

Question

我正在尝试 train a spaCy model , but turning the code into a Vertex AI Pipeline Component。我当前的代码是：

@component(
    packages_to_install=[
        "setuptools",
        "wheel", 
        "spacy[cuda113,transformers,lookups]",
    ],
    base_image="gcr.io/deeplearning-platform-release/base-cu113",
    output_component_file="train.yaml"
)
def train(train_name: str, dev_name: str) -> NamedTuple("output", [("model_path", str)]):
    """
    Trains a spacy model
    
    Parameters:
    ----------
    train_name : Name of the spaCy "train" set, used for model training.
    dev_name: Name of the spaCy "dev" set, , used for model training.
    
    Returns:
    -------
    output : Destination path of the saved model.
    """
    import spacy
    import subprocess
    
    spacy.require_gpu()  # <=== IMAGE FAILS TO BE COMPILED HERE
    
    # NOTE: The remaining code has already been tested and proven to be functional.
    #       It has been edited since the project is private.
    
    # Presets for training
    subprocess.run(["python", "-m", "spacy", "init", "fill-config", "gcs/secret_path_to_config/base_config.cfg", "config.cfg"])

    # Training model
    location = "gcs/secret_model_destination_path/TestModel"
    subprocess.run(["python", "-m", "spacy", "train", "config.cfg",
                    "--output", location,
                    "--paths.train", "gcs/secret_bucket/secret_path/{}.spacy".format(train_name),
                    "--paths.dev", "gcs/secret_bucket/secret_path/{}.spacy".format(dev_name),
                    "--gpu-id", "0"])
    
    return (location,)

Vertex AI 日志显示以下是失败的主要原因：

库已成功安装，但我觉得缺少一些库/设置（据我所知 experience）；但是我不知道如何让它“兼容Python-based Vertex AI Components”。顺便说一句，GPU 的使用在我的代码中是 强制性的。

有什么想法吗？

Answer 1

删除失败的行。 IE。 spacy.require_gpu() # <=== IMAGE FAILS TO BE COMPILED HERE

同时调整以删除 cuda 安装行 cuda113,

您的代码设置为使用 GPU，但对于学习练习，您不需要 GPU。我不知道，您也不知道如何指定启用 GPU 的 python 顶点 AI gcp 实例。因此删除了对 GPU 的要求。获得代码运行后，您可以返回并调整以添加 GPU。

Answer 2

好的，首先确保您已经在 google 云环境 上安装了 CUDA11.3 工具包，然后执行此操作使用以下命令：

sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/debian10/x86_64/7fa2af80.pub
sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/debian10/x86_64/ /"
sudo add-apt-repository contrib
sudo apt-get update
sudo apt-get -y install cuda-11-2

# optional
python -m spacy download en_core_web_trf

安装其他 pip 包和依赖项pip install torch==1.7.1+cu110 torchvision==0.8.2+cu110 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html

指向正确的 cuda 文件夹export CUDA_PATH="/usr/local/cuda-11"

安装 spacy 转换器信息 pip install -U spacy[cuda113,transformers] 这里还有更多 info: pip install cupy-cuda113

现在如果库和数据包安装正确，运行这个：

>>> import spacy
>>> spacy.require_gpu()

Answer 3

经过一些排练，我想我已经弄清楚我的代码遗漏了什么。实际上，train 组件定义是正确的（相对于最初发布的内容进行了一些小调整）；但是 管道缺少 GPU 定义。我将首先包含一个虚拟示例代码，它使用 spaCy 训练 NER 模型，并通过 Vertex AI 管道编排一切：

from kfp.v2 import compiler
from kfp.v2.dsl import pipeline, component, Dataset, Input, Output, OutputPath, InputPath
from datetime import datetime
from google.cloud import aiplatform
from typing import NamedTuple


# Component definition

@component(
    packages_to_install=[
        "setuptools",
        "wheel", 
        "spacy[cuda113,transformers,lookups]",
    ],
    base_image="gcr.io/deeplearning-platform-release/base-cu113",
    output_component_file="generate.yaml"
)
def generate_spacy_file(train_path: OutputPath(), dev_path: OutputPath()):
    """
    Generates a small, dummy 'train.spacy' & 'dev.spacy' file
    
    Returns:
    -------
    train_path : Relative location in GCS, for the "train.spacy" file.
    dev_path: Relative location in GCS, for the "dev.spacy" file.
    """
    import spacy
    from spacy.training import Example
    from spacy.tokens import DocBin

    td = [    # Train (dummy) dataset, in 'spacy V2 presentation'
              ("Walmart is a leading e-commerce company", {"entities": [(0, 7, "ORG")]}),
              ("I reached Chennai yesterday.", {"entities": [(19, 28, "GPE")]}),
              ("I recently ordered a book from Amazon", {"entities": [(24,32, "ORG")]}),
              ("I was driving a BMW", {"entities": [(16,19, "PRODUCT")]}),
              ("I ordered this from ShopClues", {"entities": [(20,29, "ORG")]}),
              ("Fridge can be ordered in Amazon ", {"entities": [(0,6, "PRODUCT")]}),
              ("I bought a new Washer", {"entities": [(16,22, "PRODUCT")]}),
              ("I bought a old table", {"entities": [(16,21, "PRODUCT")]}),
              ("I bought a fancy dress", {"entities": [(18,23, "PRODUCT")]}),
              ("I rented a camera", {"entities": [(12,18, "PRODUCT")]}),
              ("I rented a tent for our trip", {"entities": [(12,16, "PRODUCT")]}),
              ("I rented a screwdriver from our neighbour", {"entities": [(12,22, "PRODUCT")]}),
              ("I repaired my computer", {"entities": [(15,23, "PRODUCT")]}),
              ("I got my clock fixed", {"entities": [(16,21, "PRODUCT")]}),
              ("I got my truck fixed", {"entities": [(16,21, "PRODUCT")]}),
    ]
    
    dd = [    # Development (dummy) dataset (CV), in 'spacy V2 presentation'
              ("Flipkart started it's journey from zero", {"entities": [(0,8, "ORG")]}),
              ("I recently ordered from Max", {"entities": [(24,27, "ORG")]}),
              ("Flipkart is recognized as leader in market",{"entities": [(0,8, "ORG")]}),
              ("I recently ordered from Swiggy", {"entities": [(24,29, "ORG")]})
    ]

    
    # Converting Train & Development datasets, from 'spaCy V2' to 'spaCy V3'
    nlp = spacy.blank("en")
    db_train = DocBin()
    db_dev = DocBin()

    for text, annotations in td:
        example = Example.from_dict(nlp.make_doc(text), annotations)
        db_train.add(example.reference)
        
    for text, annotations in dd:
        example = Example.from_dict(nlp.make_doc(text), annotations)
        db_dev.add(example.reference)
    
    db_train.to_disk(train_path + ".spacy")  # <== Obtaining and storing "train.spacy"
    db_dev.to_disk(dev_path + ".spacy")      # <== Obtaining and storing "dev.spacy"
    

# ----------------------- ORIGINALLY POSTED CODE -----------------------

@component(
    packages_to_install=[
        "setuptools",
        "wheel", 
        "spacy[cuda113,transformers,lookups]",
    ],
    base_image="gcr.io/deeplearning-platform-release/base-cu113",
    output_component_file="train.yaml"
)
def train(train_path: InputPath(), dev_path: InputPath(), output_path: OutputPath()):
    """
    Trains a spacy model
    
    Parameters:
    ----------
    train_path : Relative location in GCS, for the "train.spacy" file.
    dev_path: Relative location in GCS, for the "dev.spacy" file.
    
    Returns:
    -------
    output : Destination path of the saved model.
    """
    import spacy
    import subprocess
    
    spacy.require_gpu()  # <=== IMAGE NOW MANAGES TO GET BUILT!

    # Presets for training
    subprocess.run(["python", "-m", "spacy", "init", "fill-config", "gcs/secret_path_to_config/base_config.cfg", "config.cfg"])

    # Training model
    subprocess.run(["python", "-m", "spacy", "train", "config.cfg",
                    "--output", output_path,
                    "--paths.train", "{}.spacy".format(train_path),
                    "--paths.dev", "{}.spacy".format(dev_path),
                    "--gpu-id", "0"])

# ----------------------------------------------------------------------
    

# Pipeline definition

@pipeline(
    pipeline_root=PIPELINE_ROOT,
    name="spacy-dummy-pipeline",
)
def spacy_pipeline():
    """
    Builds a custom pipeline
    """
    # Generating dummy "train.spacy" + "dev.spacy"
    train_dev_sets = generate_spacy_file()
    # With the output of the previous component, train a spaCy modeL    
    model = train(
        train_dev_sets.outputs["train_path"],
        train_dev_sets.outputs["dev_path"]
    
    # ------ !!! THIS SECTION DOES THE TRICK !!! ------
    ).add_node_selector_constraint(
        label_name="cloud.google.com/gke-accelerator",
        value="NVIDIA_TESLA_T4"
    ).set_gpu_limit(1).set_memory_limit('32G')
    # -------------------------------------------------

# Pipeline compilation   

compiler.Compiler().compile(
    pipeline_func=spacy_pipeline, package_path="pipeline_spacy_job.json"
)


# Pipeline run

TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")

run = aiplatform.PipelineJob(  # Include your own naming here
    display_name="spacy-dummy-pipeline",
    template_path="pipeline_spacy_job.json",
    job_id="ml-pipeline-spacydummy-small-{0}".format(TIMESTAMP),
    parameter_values={},
    enable_caching=True,
)


# Pipeline gets submitted

run.submit()

现在，解释；根据 Google:

By default, the component will run on as a Vertex AI CustomJob using an e2-standard-4 machine, with 4 core CPUs and 16GB memory.

因此，当 train 组件被编译时，它失败了，因为“它没有看到任何可用的 GPU 作为资源”；然而，在同一个 link 中，提到了 CPU 和 GPU 的所有可用设置。如您所见，在我的例子中，我在一 (1) 个 NVIDIA_TESLA_T4 GPU 卡下将 train 组件设置为运行，并且我还将 CPU 内存增加到 32GB。通过这些修改，生成的管道如下所示：

如您所见，它编译成功，并且训练（并最终获得）一个功能性的 spaCy 模型。从这里，您可以调整此代码以满足您自己的需要。

我希望这对可能感兴趣的任何人有所帮助。

谢谢。

将 spaCy 模型训练为 Vertex AI 管道 "Component"

Training spaCy model as a Vertex AI Pipeline "Component"

python

google-cloud-platform

spacy-transformers

spacy-3

google-cloud-vertex-ai