Pig：通过加载列表进行高效过滤

Question

在 Apache Pig（版本 0.16.x）中，根据数据集字段之一的现有值列表过滤数据集的最有效方法有哪些？

例如，（根据@inquisitive_mind 的提示更新）

输入：每行一个值的行分隔文件 my_codes.txt

'110'
'100'
'000'

sample_data.txt

'110', 2
'110', 3
'001', 3
'000', 1

期望的输出

'110', 2
'110', 3
'000', 1

示例脚本

%default my_codes_file 'my_codes.txt'
%default sample_data_file 'sample_data.txt'
my_codes = LOAD '$my_codes_file' as (code:chararray)
sample_data = LOAD '$sample_data_file' as (code: chararray, point: float)
desired_data = FILTER sample_data BY code IN (my_codes.code);

错误：

Scalar has more than one row in the output. 1st : ('110'), 2nd :('100') 
(common cause: "JOIN" then "FOREACH ... GENERATE foo.bar" should be "foo::bar" )

我也试过 FILTER sample_data BY code IN my_codes; 但 "IN" 子句似乎需要括号。我也试过 FILTER sample_data BY code IN (my_codes); 但得到了错误： 列需要从关系中投影才能用作标量

Answer 1

my_codes.txt 文件的代码是一行而不是 column.Since 您正在将其加载到单个字段中，代码应如下所示

'110'
'100'
'000'

或者，您可以使用 JOIN

joined_data = JOIN sample_date BY code,my_codes BY code;
desired_data = FOREACH joined_data GENERATE [=11=],;

Pig：通过加载列表进行高效过滤

Pig: efficient filtering by loaded list

apache-pig