尝试使用 Apache Pig 聚合数据时出错
Error while trying aggregate data using Apache Pig
这是我的代码 运行:
bigrams = LOAD 's3://******' AS (bigram:chararray, year:int, occurrences:int, books:int);
bg_tmp = filter bigrams BY (occurrences >= 300) AND (books >= 12);
bg_tmp_2 = GROUP bg_tmp ALL;
occ_cnt = FOREACH bg_tmp_2 GENERATE bigram, SUM(bg_tmp_2.occurrences);
x = LIMIT occ_cnt 100;
DUMP x;
这是我在计算时遇到的错误 occ_cnt
81201 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: Pig script failed to parse: <line 5, column 48> Invalid scalar projection: bg_tmp_218/10/26 16:05:07 ERROR grunt.Grunt: ERROR 1200: Pig script failed to parse: <line 5, column 48> Invalid scalar projection: bg_tmp_2
Details at logfile: /mnt/var/log/pig/pig_1540569826316.log
我不知道为什么会这样。我在 AWS EMR
上使用 Apache Pig 0.17.0 和 Hadoop 2.8.4
我会将您的查询重写为
bg_tmp_2 = GROUP bg_tmp by (bigram);
occ_cnt = FOREACH bg_tmp_2 GENERATE group, SUM(bg_tmp.occurrences);
正在替换 GROUP ALL,因为我认为您需要每个二元词项的 SUM。
将 bg_tmp2 替换为 bg_tmp,因为您想在 bg_tmp_2 关系中引用 bg_tmp BAG。
(如果您 运行 "describe bg_tmp_2",您将看到以下架构)
bg_tmp_2: {group: chararray,bg_tmp: {(bigram: chararray,year: int,occurrences: int,books: int)}}
这是我的代码 运行:
bigrams = LOAD 's3://******' AS (bigram:chararray, year:int, occurrences:int, books:int);
bg_tmp = filter bigrams BY (occurrences >= 300) AND (books >= 12);
bg_tmp_2 = GROUP bg_tmp ALL;
occ_cnt = FOREACH bg_tmp_2 GENERATE bigram, SUM(bg_tmp_2.occurrences);
x = LIMIT occ_cnt 100;
DUMP x;
这是我在计算时遇到的错误 occ_cnt
81201 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: Pig script failed to parse: <line 5, column 48> Invalid scalar projection: bg_tmp_218/10/26 16:05:07 ERROR grunt.Grunt: ERROR 1200: Pig script failed to parse: <line 5, column 48> Invalid scalar projection: bg_tmp_2
Details at logfile: /mnt/var/log/pig/pig_1540569826316.log
我不知道为什么会这样。我在 AWS EMR
上使用 Apache Pig 0.17.0 和 Hadoop 2.8.4我会将您的查询重写为
bg_tmp_2 = GROUP bg_tmp by (bigram);
occ_cnt = FOREACH bg_tmp_2 GENERATE group, SUM(bg_tmp.occurrences);
正在替换 GROUP ALL,因为我认为您需要每个二元词项的 SUM。 将 bg_tmp2 替换为 bg_tmp,因为您想在 bg_tmp_2 关系中引用 bg_tmp BAG。
(如果您 运行 "describe bg_tmp_2",您将看到以下架构)
bg_tmp_2: {group: chararray,bg_tmp: {(bigram: chararray,year: int,occurrences: int,books: int)}}