如何在猪中实现 Union All?

How to achive Union All in pig?

我有 3 个数据集,每个数据集有 415 GB 的数据并且属于不同的域。

我需要使用 pig 将它们全部联合起来,但我只能使用它 union 子句,该子句在作业结束时启动 reducer 以删除不同的值。

a = union a1, a2
data = union a, a3

有没有办法跳过 reducer 部分,因为数据已经不同了。

来自 UNION 上的文档:

Use the UNION operator to merge the contents of two or more relations. The UNION operator:

  • Does not preserve the order of tuples. Both the input and output relations are interpreted as unordered bags of tuples.
  • Does not ensure (as databases do) that all tuples adhere to the same schema or that they have the same number of fields. In a typical scenario, however, this should be the case; therefore, it is the user's responsibility to either (1) ensure that the tuples in the input relations have the same schema or (2) be able to process varying tuples in the output relation.
  • Does not eliminate duplicate tuples.

重点是我的。这向我表明不需要缩减器步骤来完成 UNION 因为它不需要删除重复的行。您确定 reducer 作业是 UNION 的结果吗?这可能是另一个操作员的结果。

奖励: 您可以将示例简化为:

B = UNION a1, a2, a3 ;