如何将 Cloud Composer 与计算引擎集成

How to integrate cloud composer with compute Engine

大家好, 我是 GCP 的新手(从 Aws 转到 GCP),我有一个蹩脚的问题(请原谅)。我们正在使用 GCP 构建传统的 EDW。作为调度程序的一部分,我们有云作曲家,我们所有的代码都位于计算引擎中(就像 AWS 中的 Ec2 实例)。

我如何设置工作流以从 Compute Engine 运行 我的作业?或者实现相同的最佳解决方案是什么?

关于我们管道的更多信息: 管道 1:从 sql db(旧版)中提取数百万行,执行一些 etl 逻辑 [清理、添加新列、删除列、大写列值等],最后加载到 redshift

管道 2:从 Google 工作表读取数据,执行上述 etl 逻辑并加载到不同的 redshift table[s]。

管道 3:从 Google API 读取数据,执行清理,插入 redshift 等。

如何最好地使用 Cloud Composer 编写我的 ETL 工作流。

非常感谢任何帮助!

----------PROJECT STRUCTURE & REQUIREMENTS------------
In my compute Engine I have project like :

    /home/ubunutu/projects/project1
        /venv
        /src/job1.py ( reads googlesheets and loads into cloudsql)
        /src/job2.py ( Reads Google Adwords API, do some cleaning, modifying attributes and load into cloudsql)
    
    
    /home/ubunutu/projects/project2
        /venv
        /src/job1.py ( Read file from GCS, perform cleaning,adding/remving columns and load into cloudsql)
       /src/job2.py ( Reads data from a cloudsql table A and perform some modifications and loads into cloudsql table B)
    
    
    
    
     Now in composer, how do I orchestrate the complete work flow? Python jobs sits in Compute engine and I need to execute them.
    
    The reason Why we use compute Engine is to perform some in-memory opearions like reading data in dataframe, do some group by, create new columns, creating temporary files and so on.
    
    or what would be your suggestions?
    As like moving the whole sandbox to composer's /data directory as like,
    /data/projects/project1
        /venv
        /src/job1.py ( reads googlesheets and loads into cloudsql)
        /src/job2.py ( Reads Google Adwords API, do some cleaning, modifying attributes and load into cloudsql)
    
    
    /data/projects/project2
        /venv
        /src/job1.py ( Read file from GCS, perform cleaning,adding/removing columns and load into cloudsql)
        /src/job2.py ( Reads data from a cloudsql table A and perform some modifications and loads into cloud sql table B)
    
    
    In this case,
        1. Will I be able to download any temporary files in composer server and perform some operations on it?
        2. I shall not be needed to create venv If I place my code in composer directly as I can install packages via PyPI in console?

----------------------------------------------------------

你能用你宝贵的知识帮助我吗?非常感谢!

提前致谢!

这里有一个设计模式,您可以根据自己的需要进行调整。 Task scheduling on Compute Engine with Cloud Scheduler

假设您可以设置 Pub/Sub 个主题和订阅,您可以...

  • 在 Composer 中有一个 DAG,它 运行 编写一些代码并向 pub/sub 主题发布消息
  • 在 Compute 中有一个订阅该主题的进程 运行ning。收到消息后,触发您需要的脚本运行。
  • 完成后,通知一个 pub/sub 主题
  • 有一个单独的 DAG,在收到消息时在 Composer 中触发(注意:有多种方法可以做到这一点。请参阅 here)。