NOT 运算符将创建 2 个互斥组
NOT operator is going to make 2 mutually exclusive groups
在我的脚本中,我读取了多个文件,并使用一个正则表达式及其补码将记录划分为 2 groups/classes。我期待两个互斥的类 但是我统计记录的时候没有发现...
所以我添加了一个 SPLIT 部分来查找 'rest' 未包含在我的约束及其补码中的记录。结果(再次)不是预期的......
我的脚本有什么问题?谢谢你的帮助!
预期的'math':
input: 1464 records
ouputs: 264 + 870 + ???_330__??
脚本:
A = load 'input/*' using PigStorage('\t','-tagPath') as (src:chararray, content:chararray);
Ac = foreach (GROUP A all) generate COUNT(A);
B = filter A by content MATCHES '(^\b[BCDFMSTX].*\b\:\s{1}.*)';
Bc = foreach (GROUP B all) generate COUNT(B);
Bnot = filter A by NOT content MATCHES '(^\b[BCDFMSTX].*\b\:\s{1}.*)';
Bcnot = foreach (GROUP Bnot all) generate COUNT(Bnot);
SPLIT A INTO SET1 IF (content MATCHES '(^\b[BCDFMSTX].*\b\:\s{1}.*)')
, SET2 IF (NOT content MATCHES '(^\b[BCDFMSTX].*\b\:\s{1}.*)')
, SETn OTHERWISE;
STORE SET1 into 'output/set1';
STORE SET2 into 'output/set2';
STORE SETn into 'output/setn';
结果:
Input(s):
Successfully read 1464 records (49024 bytes) from: "hdfs://localhost:9000/user/dag/input/*"
Output(s):
Successfully stored 264 records (25276 bytes) in: "hdfs://localhost:9000/user/dag/output/set1"
Successfully stored 870 records (84190 bytes) in: "hdfs://localhost:9000/user/dag/output/set2"
Successfully stored 0 records in: "hdfs://localhost:9000/user/dag/output/setn"
我假设在 330 个案例中内容是 null
。如果用 content is null OR NOT content MATCHES '(^\b[BCDFMSTX].*\b\:\s{1}.*)'
替换布尔表达式,它应该可以工作。
话虽这么说,但我认为这不是很直观,我认为 Pig 应该抛出 NullPointerException 或至少记录一个警告。
在我的脚本中,我读取了多个文件,并使用一个正则表达式及其补码将记录划分为 2 groups/classes。我期待两个互斥的类 但是我统计记录的时候没有发现... 所以我添加了一个 SPLIT 部分来查找 'rest' 未包含在我的约束及其补码中的记录。结果(再次)不是预期的...... 我的脚本有什么问题?谢谢你的帮助!
预期的'math':
input: 1464 records
ouputs: 264 + 870 + ???_330__??
脚本:
A = load 'input/*' using PigStorage('\t','-tagPath') as (src:chararray, content:chararray);
Ac = foreach (GROUP A all) generate COUNT(A);
B = filter A by content MATCHES '(^\b[BCDFMSTX].*\b\:\s{1}.*)';
Bc = foreach (GROUP B all) generate COUNT(B);
Bnot = filter A by NOT content MATCHES '(^\b[BCDFMSTX].*\b\:\s{1}.*)';
Bcnot = foreach (GROUP Bnot all) generate COUNT(Bnot);
SPLIT A INTO SET1 IF (content MATCHES '(^\b[BCDFMSTX].*\b\:\s{1}.*)')
, SET2 IF (NOT content MATCHES '(^\b[BCDFMSTX].*\b\:\s{1}.*)')
, SETn OTHERWISE;
STORE SET1 into 'output/set1';
STORE SET2 into 'output/set2';
STORE SETn into 'output/setn';
结果:
Input(s):
Successfully read 1464 records (49024 bytes) from: "hdfs://localhost:9000/user/dag/input/*"
Output(s):
Successfully stored 264 records (25276 bytes) in: "hdfs://localhost:9000/user/dag/output/set1"
Successfully stored 870 records (84190 bytes) in: "hdfs://localhost:9000/user/dag/output/set2"
Successfully stored 0 records in: "hdfs://localhost:9000/user/dag/output/setn"
我假设在 330 个案例中内容是 null
。如果用 content is null OR NOT content MATCHES '(^\b[BCDFMSTX].*\b\:\s{1}.*)'
替换布尔表达式,它应该可以工作。
话虽这么说,但我认为这不是很直观,我认为 Pig 应该抛出 NullPointerException 或至少记录一个警告。