如何合并两个结果并将其通过管道传输到 apache-beam 管道中的下一步

How to combine two results and pipe it to next step in apache-beam pipeline

见下面的代码片段, 我希望 ["metric1", "metric2"] 作为 RunTask.process 的输入。然而,它是 运行 两次,分别是“metric1”和“metric2”

def run():
  
  pipeline_options = PipelineOptions(pipeline_args)
  pipeline_options.view_as(SetupOptions).save_main_session = save_main_session
  p = beam.Pipeline(options=pipeline_options)

  root = p | 'Get source' >> beam.Create([
      "source_name" # maybe ["source_name"] makes more sense since my process function takes an array as an input?
  ])

  metric1 = root | "compute1" >> beam.ParDo(RunLongCompute(myarg="1")) #let's say it returns ["metic1"]
  metric2 = root | "compute2" >> beam.ParDo(RunLongCompute(myarg="2")) #let's say it returns ["metic2"]

  metric3 = (metric1, metric2) | beam.Flatten() | beam.ParDo(RunTask()) # I want ["metric1", "metric2"] to be my input for RunTask.process. However it was run twice with "metric1" and "metric2" respectively

  

我了解到您想以遵循以下语法的方式加入两个 PCollection:['element1','element2']。为了实现这一点,您可以使用 CoGroupByKey() instead of Flatten().

考虑到您的代码片段,语法为:

def run():
  
  pipeline_options = PipelineOptions(pipeline_args)
  pipeline_options.view_as(SetupOptions).save_main_session = save_main_session
  p = beam.Pipeline(options=pipeline_options)

  root = p | 'Get source' >> beam.Create([
      "source_name" # maybe ["source_name"] makes more sense since my process function takes an array as an input?
  ])

  metric1 = root | "compute1" >> beam.ParDo(RunLongCompute(myarg="1")) #let's say it returns ["metic1"]
  metric2 = root | "compute2" >> beam.ParDo(RunLongCompute(myarg="2")) #let's say it returns ["metic2"]

  metric3 = (
       (metric1, metric2) 
       | beam.CoGroupByKey() 
       | beam.ParDo(RunTask()) 
 )

我想指出 Flatten() 和 CoGroupByKey() 之间的区别。

1) Flatten()接收两个或多个存储相同数据类型的PCollection,将它们合并为一个逻辑PCollection .例如,

import apache_beam as beam

from apache_beam import Flatten, Create, ParDo, Map

p = beam.Pipeline()

adress_list = [
    ('leo', 'George St. 32'),
    ('ralph', 'Pyrmont St. 30'),
    ('mary', '10th Av.'),
    ('carly', 'Marina Bay 1'),
]
city_list = [
    ('leo', 'Sydney'),
    ('ralph', 'Sydney'),
    ('mary', 'NYC'),
    ('carly', 'Brisbane'),
]

street = p | 'CreateEmails' >> beam.Create(adress_list)
city = p | 'CreatePhones' >> beam.Create(city_list)

resul =(
    (street,city)
    |beam.Flatten()
    |ParDo(print)
)

p.run()

和输出,

('leo', 'George St. 32')
('ralph', 'Pyrmont St. 30')
('mary', '10th Av.')
('carly', 'Marina Bay 1')
('leo', 'Sydney')
('ralph', 'Sydney')
('mary', 'NYC')
('carly', 'Brisbane')

请注意,两个 PCollections 都在输出中。但是,一个附加到另一个。

2) CoGroupByKey() 在具有相同键类型的两个或多个键值集合之间执行关系连接。使用此方法,您将执行按键连接,而不是像 Flatten() 中那样附加。下面是一个例子,

import apache_beam as beam

from apache_beam import Flatten, Create, ParDo, Map

p = beam.Pipeline()

address_list = [
    ('leo', 'George St. 32'),
    ('ralph', 'Pyrmont St. 30'),
    ('mary', '10th Av.'),
    ('carly', 'Marina Bay 1'),
]
city_list = [
    ('leo', 'Sydney'),
    ('ralph', 'Sydney'),
    ('mary', 'NYC'),
    ('carly', 'Brisbane'),
]

street = p | 'CreateEmails' >> beam.Create(address_list)
city = p | 'CreatePhones' >> beam.Create(city_list)

results = (
    (street, city)
    | beam.CoGroupByKey()
    |ParDo(print)
    #| beam.io.WriteToText('delete.txt')
    
)

p.run()

和输出,

('leo', (['George St. 32'], ['Sydney']))
('ralph', (['Pyrmont St. 30'], ['Sydney']))
('mary', (['10th Av.'], ['NYC']))
('carly', (['Marina Bay 1'], ['Brisbane']))

请注意,您需要 主键 才能加入结果。此外,此输出是您所期望的。

或者,使用侧输入:

metrics3 = metric1 | beam.ParDo(RunTask(), metric2=beam.pvalue.AsIter(metric2))

在RunTask进程()中:

def process(self, element_from_metric1, metric2):
  ...