Apache Pig 转换顺序
Order of Apache Pig Transformations
我正在阅读 Alan Gates 的 Pig Programming。
考虑代码:
ratings = LOAD '/user/maria_dev/ml-100k/u.data' AS
(userID:int, movieID:int, rating:int, ratingTime:int);
metadata = LOAD '/user/maria_dev/ml-100k/u.item' USING PigStorage ('|') AS
(movieID:int, movieTitle:chararray, releaseDate:chararray, imdbLink: chararray);
nameLookup = FOREACH metadata GENERATE
movieID, movieTitle, ToDate(releaseDate, 'dd-MMM-yyyy') AS releaseYear;
nameLookupYear = FOREACH nameLookup GENERATE
movieID, movieTitle, GetYear(releaseYear) AS finalYear;
filterMovies = FILTER nameLookupYear BY finalYear < 1982;
groupedMovies = GROUP filterMovies BY finalYear;
orderedMovies = FOREACH groupedMovies {
sortOrder = ORDER metadata by finalYear DESC;
GENERATE GROUP, finalYear;
};
DUMP orderedMovies;
它指出
"Sorting by maps, tuples or bags produces error".
我想知道如何对分组结果进行排序。
转换是否需要遵循特定的顺序才能起作用?
如果您要对分组的值进行排序,则必须使用嵌套的 foreach。这将在组内按降序对年份进行排序。
orderedMovies = FOREACH groupedMovies {
sortOrder = ORDER metadata by GetYear(ToDate(releaseDate, 'dd-MMM-yyyy')) DESC;
GENERATE GROUP, movieID, movieTitle;
};
由于您正在尝试对分组结果进行排序,因此不需要嵌套的 foreach。例如,如果您尝试按标题或发行日期对一年内的每部电影进行排序,则可以使用嵌套的 foreach。尝试像往常一样订购(将 finalYear
称为 group
,因为您在上一行中按 finalYear
分组):
orderedMovies = ORDER groupedMovies BY group ASC;
DUMP orderedMovies;
我正在阅读 Alan Gates 的 Pig Programming。
考虑代码:
ratings = LOAD '/user/maria_dev/ml-100k/u.data' AS
(userID:int, movieID:int, rating:int, ratingTime:int);
metadata = LOAD '/user/maria_dev/ml-100k/u.item' USING PigStorage ('|') AS
(movieID:int, movieTitle:chararray, releaseDate:chararray, imdbLink: chararray);
nameLookup = FOREACH metadata GENERATE
movieID, movieTitle, ToDate(releaseDate, 'dd-MMM-yyyy') AS releaseYear;
nameLookupYear = FOREACH nameLookup GENERATE
movieID, movieTitle, GetYear(releaseYear) AS finalYear;
filterMovies = FILTER nameLookupYear BY finalYear < 1982;
groupedMovies = GROUP filterMovies BY finalYear;
orderedMovies = FOREACH groupedMovies {
sortOrder = ORDER metadata by finalYear DESC;
GENERATE GROUP, finalYear;
};
DUMP orderedMovies;
它指出
"Sorting by maps, tuples or bags produces error".
我想知道如何对分组结果进行排序。
转换是否需要遵循特定的顺序才能起作用?
如果您要对分组的值进行排序,则必须使用嵌套的 foreach。这将在组内按降序对年份进行排序。
orderedMovies = FOREACH groupedMovies {
sortOrder = ORDER metadata by GetYear(ToDate(releaseDate, 'dd-MMM-yyyy')) DESC;
GENERATE GROUP, movieID, movieTitle;
};
由于您正在尝试对分组结果进行排序,因此不需要嵌套的 foreach。例如,如果您尝试按标题或发行日期对一年内的每部电影进行排序,则可以使用嵌套的 foreach。尝试像往常一样订购(将 finalYear
称为 group
,因为您在上一行中按 finalYear
分组):
orderedMovies = ORDER groupedMovies BY group ASC;
DUMP orderedMovies;