在 Pig 中计算行数的有效方法是什么?

What's the effective way to count rows in Pig?

在Pig中,获取计数的有效方法是什么?我们可以做一个 GROUP ALL,但这只给定了 1 个 reducer。当数据量非常大时,比如 n TB,我们是否可以尝试使用多个 reducer?

  dataCount = FOREACH (GROUP data ALL) GENERATE 
    'count' as metric,
    COUNT(dataCount) as value;

我只是在这个主题上深入挖掘了一点,如果您使用的是最新的 pig 版本,您似乎不必担心单个 reducer 将不得不处理大量数据. 代数 UDF-s 将智能处理 COUNT,它是在映射器上计算的。所以reducer只需要处理聚合数据(counts/mapper)。 我认为它是在 0.9.1 中引入的,但是 0.14.0 肯定有它

Algebraic Interface

An aggregate function is an eval function that takes a bag and returns a scalar value. One interesting and useful property of many aggregate functions is that they can be computed incrementally in a distributed fashion. We call these functions algebraic. COUNT is an example of an algebraic function because we can count the number of elements in a subset of the data and then sum the counts to produce a final output. In the Hadoop world, this means that the partial computations can be done by the map and combiner, and the final result can be computed by the reducer.

但是我之前的回答肯定是错误的:

In the grouping you can use the PARALLEL n keyword this set the number of reducers.

Increase the parallelism of a job by specifying the number of reduce tasks, n. The default value for n is 1 (one reduce task).

与其直接使用 GROUP ALL,不如将其分为两步。首先,按某个字段分组并计算行数。然后,执行 GROUP ALL 对所有这些计数求和。这样,您就可以并行计算行数。

但是请注意,如果您在第一个 GROUP BY 中使用的字段没有重复项,则结果计数将全部为 1,因此不会有任何差异。尝试使用具有许多重复项的字段来提高其性能。

看这个例子:

a;1
a;2
b;3
b;4
b;5

如果我们首先按第一个有重复的字段分组,最后的 COUNT 将处理 2 行而不是 5 行:

A = load 'data' using PigStorage(';');
B = group A by [=11=];
C = foreach B generate COUNT(A);
dump C;
(2)
(3)
D = group C all;
E = foreach D generate SUM(C.[=11=]);
dump E;
(5)

但是,如果我们按唯一的第二个分组,它将处理 5 行:

A = load 'data' using PigStorage(';');
B = group A by ;
C = foreach B generate COUNT(A);
dump C;
(1)
(1)
(1)
(1)
(1)
D = group C all;
E = foreach D generate SUM(C.[=12=]);
dump E;
(5)