如何在猪中实现 Union All?
How to achive Union All in pig?
我有 3 个数据集,每个数据集有 415 GB 的数据并且属于不同的域。
我需要使用 pig 将它们全部联合起来,但我只能使用它 union 子句,该子句在作业结束时启动 reducer 以删除不同的值。
a = union a1, a2
data = union a, a3
有没有办法跳过 reducer 部分,因为数据已经不同了。
来自 UNION
上的文档:
Use the UNION operator to merge the contents of two or more relations.
The UNION operator:
- Does not preserve the order of tuples. Both the input and output
relations are interpreted as unordered bags of tuples.
- Does not ensure
(as databases do) that all tuples adhere to the same schema or that
they have the same number of fields. In a typical scenario, however,
this should be the case; therefore, it is the user's responsibility to
either (1) ensure that the tuples in the input relations have the same
schema or (2) be able to process varying tuples in the output
relation.
- Does not eliminate duplicate tuples.
重点是我的。这向我表明不需要缩减器步骤来完成 UNION
因为它不需要删除重复的行。您确定 reducer 作业是 UNION
的结果吗?这可能是另一个操作员的结果。
奖励: 您可以将示例简化为:
B = UNION a1, a2, a3 ;
我有 3 个数据集,每个数据集有 415 GB 的数据并且属于不同的域。
我需要使用 pig 将它们全部联合起来,但我只能使用它 union 子句,该子句在作业结束时启动 reducer 以删除不同的值。
a = union a1, a2
data = union a, a3
有没有办法跳过 reducer 部分,因为数据已经不同了。
来自 UNION
上的文档:
Use the UNION operator to merge the contents of two or more relations. The UNION operator:
- Does not preserve the order of tuples. Both the input and output relations are interpreted as unordered bags of tuples.
- Does not ensure (as databases do) that all tuples adhere to the same schema or that they have the same number of fields. In a typical scenario, however, this should be the case; therefore, it is the user's responsibility to either (1) ensure that the tuples in the input relations have the same schema or (2) be able to process varying tuples in the output relation.
- Does not eliminate duplicate tuples.
重点是我的。这向我表明不需要缩减器步骤来完成 UNION
因为它不需要删除重复的行。您确定 reducer 作业是 UNION
的结果吗?这可能是另一个操作员的结果。
奖励: 您可以将示例简化为:
B = UNION a1, a2, a3 ;