尝试执行 Pig Latin 脚本时出现异常
Getting exception while trying to execute a Pig Latin Script
我正在自学 Pig,在尝试探索数据集时遇到异常。脚本有什么问题以及原因:
movies_data = LOAD '/movies_data' using PigStorage(',') as (id:chararray,title:chararray,year:int,rating:double,duration:double);
high = FILTER movies_data by rating > 4.0;
high_rated = FOREACH high GENERATE movies_data.title,movies_data.year,movies_data.rating,movies_data.duration;
DUMP high_rated;
在 MAP Reduce 执行结束时,出现以下错误。
2018-07-22 20:11:07,213 [main] ERROR org.apache.pig.tools.grunt.Grunt
ERROR 1066: Unable to open iterator for alias high_rated.
Backend error : org.apache.pig.backend.executionengine.ExecException:
ERROR 0: Scalar has more than one row in the output.
1st : (1,The Nightmare Before Christmas,1993,3.9,4568.0),
2nd :(2,The Mummy,1932,3.5,4388.0)
(common cause: "JOIN" then "FOREACH ... GENERATE foo.bar" should be "foo::bar" )
首先,让我们看看如何解决您的问题。您不需要使用别名访问您的字段。您的第三行可以是:
high_rated = FOREACH high GENERATE title, year, rating, duration;
如果您出于某种原因想要使用别名,您应该使用引用运算符 (::),如错误建议中所示。然后你的行看起来像:
high_rated = FOREACH high GENERATE movies_data::title, movies_data::year, movies_data::rating, movies_data::duration;
接下来,让我们尝试了解错误消息背后的确切原因。当您尝试使用点运算符 (.) 访问字段时,pig 会假定别名是标量(别名只有一行)。由于您的别名不止一行,因此它会抱怨。您可以在此处阅读有关 Pig 中标量的更多信息:https://issues.apache.org/jira/browse/PIG-1434
在 JIRA 的发行说明部分,您会在末尾注意到,预期的错误消息与您收到的错误相符:
If a relation contains more than single tuple, a runtime error is generated:
"Scalar has more than one row in the output"
这对你来说没有错误。
movies_data = LOAD '/movies_data' using PigStorage(',') as (id:chararray,title:chararray,year:int,rating:double,duration:double);
high = FILTER movies_data by rating > 4.0;
high_rated = FOREACH high GENERATE title,year,rating,duration;
DUMP high_rated;
FILTER命令允许所有满足过滤条件的列记录。
我正在自学 Pig,在尝试探索数据集时遇到异常。脚本有什么问题以及原因:
movies_data = LOAD '/movies_data' using PigStorage(',') as (id:chararray,title:chararray,year:int,rating:double,duration:double);
high = FILTER movies_data by rating > 4.0;
high_rated = FOREACH high GENERATE movies_data.title,movies_data.year,movies_data.rating,movies_data.duration;
DUMP high_rated;
在 MAP Reduce 执行结束时,出现以下错误。
2018-07-22 20:11:07,213 [main] ERROR org.apache.pig.tools.grunt.Grunt
ERROR 1066: Unable to open iterator for alias high_rated.
Backend error : org.apache.pig.backend.executionengine.ExecException:
ERROR 0: Scalar has more than one row in the output.
1st : (1,The Nightmare Before Christmas,1993,3.9,4568.0),
2nd :(2,The Mummy,1932,3.5,4388.0)
(common cause: "JOIN" then "FOREACH ... GENERATE foo.bar" should be "foo::bar" )
首先,让我们看看如何解决您的问题。您不需要使用别名访问您的字段。您的第三行可以是:
high_rated = FOREACH high GENERATE title, year, rating, duration;
如果您出于某种原因想要使用别名,您应该使用引用运算符 (::),如错误建议中所示。然后你的行看起来像:
high_rated = FOREACH high GENERATE movies_data::title, movies_data::year, movies_data::rating, movies_data::duration;
接下来,让我们尝试了解错误消息背后的确切原因。当您尝试使用点运算符 (.) 访问字段时,pig 会假定别名是标量(别名只有一行)。由于您的别名不止一行,因此它会抱怨。您可以在此处阅读有关 Pig 中标量的更多信息:https://issues.apache.org/jira/browse/PIG-1434
在 JIRA 的发行说明部分,您会在末尾注意到,预期的错误消息与您收到的错误相符:
If a relation contains more than single tuple, a runtime error is generated:
"Scalar has more than one row in the output"
这对你来说没有错误。
movies_data = LOAD '/movies_data' using PigStorage(',') as (id:chararray,title:chararray,year:int,rating:double,duration:double);
high = FILTER movies_data by rating > 4.0;
high_rated = FOREACH high GENERATE title,year,rating,duration;
DUMP high_rated;
FILTER命令允许所有满足过滤条件的列记录。