Apache Pig 转换顺序

Order of Apache Pig Transformations

我正在阅读 Alan Gates 的 Pig Programming。

考虑代码:

ratings = LOAD '/user/maria_dev/ml-100k/u.data' AS 
    (userID:int, movieID:int, rating:int, ratingTime:int);

metadata = LOAD '/user/maria_dev/ml-100k/u.item' USING PigStorage ('|') AS 
    (movieID:int, movieTitle:chararray, releaseDate:chararray, imdbLink: chararray);

nameLookup = FOREACH metadata GENERATE 
    movieID, movieTitle, ToDate(releaseDate, 'dd-MMM-yyyy') AS releaseYear;

nameLookupYear = FOREACH nameLookup GENERATE 
    movieID, movieTitle, GetYear(releaseYear) AS finalYear;

filterMovies = FILTER nameLookupYear BY finalYear < 1982;

groupedMovies = GROUP filterMovies BY finalYear;

orderedMovies = FOREACH groupedMovies {
    sortOrder = ORDER metadata by finalYear DESC;
    GENERATE GROUP, finalYear;
    };

DUMP orderedMovies;

它指出

"Sorting by maps, tuples or bags produces error".

我想知道如何对分组结果进行排序。

转换是否需要遵循特定的顺序才能起作用?

如果您要对分组的值进行排序,则必须使用嵌套的 foreach。这将在组内按降序对年份进行排序。

orderedMovies = FOREACH groupedMovies {
      sortOrder = ORDER metadata by GetYear(ToDate(releaseDate, 'dd-MMM-yyyy')) DESC;
      GENERATE GROUP, movieID, movieTitle;
};

由于您正在尝试对分组结果进行排序,因此不需要嵌套的 foreach。例如,如果您尝试按标题或发行日期对一年内的每部电影进行排序,则可以使用嵌套的 foreach。尝试像往常一样订购(将 finalYear 称为 group,因为您在上一行中按 finalYear 分组):

orderedMovies = ORDER groupedMovies BY group ASC;

DUMP orderedMovies;