NOT 运算符将创建 2 个互斥组

NOT operator is going to make 2 mutually exclusive groups

在我的脚本中,我读取了多个文件,并使用一个正则表达式及其补码将记录划分为 2 groups/classes。我期待两个互斥的类 但是我统计记录的时候没有发现... 所以我添加了一个 SPLIT 部分来查找 'rest' 未包含在我的约束及其补码中的记录。结果(再次)不是预期的...... 我的脚本有什么问题?谢谢你的帮助!

预期的'math':

 input: 1464 records
 ouputs: 264 + 870 + ???_330__?? 

脚本:

A = load 'input/*' using PigStorage('\t','-tagPath') as (src:chararray, content:chararray);
Ac = foreach (GROUP A all) generate COUNT(A);

B = filter A by content MATCHES '(^\b[BCDFMSTX].*\b\:\s{1}.*)';
Bc = foreach (GROUP B all) generate COUNT(B);

Bnot = filter A by NOT content MATCHES '(^\b[BCDFMSTX].*\b\:\s{1}.*)';
Bcnot = foreach (GROUP Bnot all) generate COUNT(Bnot);

SPLIT A INTO SET1 IF (content MATCHES '(^\b[BCDFMSTX].*\b\:\s{1}.*)')
              , SET2 IF (NOT content MATCHES '(^\b[BCDFMSTX].*\b\:\s{1}.*)')
              , SETn OTHERWISE;

STORE SET1 into 'output/set1';
STORE SET2 into 'output/set2';
STORE SETn into 'output/setn';

结果:

 Input(s):
 Successfully read 1464 records (49024 bytes) from: "hdfs://localhost:9000/user/dag/input/*"

 Output(s):
 Successfully stored 264 records (25276 bytes) in: "hdfs://localhost:9000/user/dag/output/set1"
 Successfully stored 870 records (84190 bytes) in: "hdfs://localhost:9000/user/dag/output/set2"
 Successfully stored 0 records in: "hdfs://localhost:9000/user/dag/output/setn"

我假设在 330 个案例中内容是 null。如果用 content is null OR NOT content MATCHES '(^\b[BCDFMSTX].*\b\:\s{1}.*)' 替换布尔表达式,它应该可以工作。

话虽这么说,但我认为这不是很直观,我认为 Pig 应该抛出 NullPointerException 或至少记录一个警告。