MIn max group wise 和 filter without join in pig
MIn max group wise and filter without join in pig
我正在尝试为每个组找到 (max+min)/2。以下是我的架构
UrlXpathsCount: {url: chararray,leafpathstr: chararray,urlpath_count: long}
我正在尝试按 url 字段
对其进行分组
byUrl = GROUP UrlXpathsCount by url;
我正试图通过以下方式找到 (max+min)/2。
midRangeByUrl = FOREACH byUrl{
urls_desc = order UrlXpathsCount by urlpath_count desc;
urls_max = limit urls_desc 1;
urls_asc = order UrlXpathsCount by urlpath_count asc;
urls_min = limit urls_asc 1;
GENERATE FLATTEN(urls_max),FLATTEN(urls_min);
};
以下是 midRangeByUrl 的架构
midRangeByUrl: {urls_max::url: chararray,urls_max::leafpathstr: chararray,urls_max::urlpath_count: long,urls_min::url: chararray,urls_min::leafpathstr: chararray,urls_min::urlpath_count: long}
我现在面临的问题是添加 FLATTEN(group) ,FLATTEN(urls_max) , FLATTEN(urls_min) 给了我很多我不想要的组合。
我想为每组获得 max + min/2。
为此,我通过以下
投影最大值和最小值的 urlpath_count
computeMidRange = FOREACH midRangeByUrl generate urls_max::url as mid_url,((DOUBLE)urls_max::urlpath_count+(DOUBLE) urls_min::urlpath_count)/2 as midRange;
我通过以下方式连接两个表
/* Join computeMidRange and UrlXpathsCount */
midRangeJoin = join UrlXpathsCount by url , computeMidRange by mid_url using 'replicated';
midRangeOut = FOREACH midRangeJoin GENERATE UrlXpathsCount::url as url,UrlXpathsCount::leafpathstr as leafpathstr,
UrlXpathsCount::urlpath_count as urlpath_count,computeMidRange::midRange as midRange;
然后过滤应用过滤器
templates = FILTER midRangeOut by urlpath_count > midRange;
我想避开 midRangeJoin 。通过以某种方式计算 midRangeByUrl 并投影以下字段 url, urlpath_count ,leafpathstr , (min+max)/2 没有连接。
请帮我解决这个问题。
谢谢
您可以改用内置 MAX
和 MIN
UDF:
UrlXpathsCount = load 'your_data' using PigStorage(',') as (url: chararray,leafpathstr: chararray,urlpath_count: long);
B = GROUP UrlXpathsCount by url;
C = foreach B generate group as url, MAX(UrlXpathsCount.urlpath_count) as max_count,
MIN(UrlXpathsCount.urlpath_count) as min_count;
D = foreach C generate url, ((double)max_count + (double)min_count)/2 as val;
这将完全按照您的要求进行,无需嵌套的 foreach 或连接。我将计算分为 C
和 D
以避免一行非常长,但你也可以在一行中完成。请记住将值转换为 double
,因为您的 urlpath_count
是一个 long
,所以如果您不转换它,您将不会得到任何小数。
我正在尝试为每个组找到 (max+min)/2。以下是我的架构
UrlXpathsCount: {url: chararray,leafpathstr: chararray,urlpath_count: long}
我正在尝试按 url 字段
对其进行分组byUrl = GROUP UrlXpathsCount by url;
我正试图通过以下方式找到 (max+min)/2。
midRangeByUrl = FOREACH byUrl{
urls_desc = order UrlXpathsCount by urlpath_count desc;
urls_max = limit urls_desc 1;
urls_asc = order UrlXpathsCount by urlpath_count asc;
urls_min = limit urls_asc 1;
GENERATE FLATTEN(urls_max),FLATTEN(urls_min);
};
以下是 midRangeByUrl 的架构
midRangeByUrl: {urls_max::url: chararray,urls_max::leafpathstr: chararray,urls_max::urlpath_count: long,urls_min::url: chararray,urls_min::leafpathstr: chararray,urls_min::urlpath_count: long}
我现在面临的问题是添加 FLATTEN(group) ,FLATTEN(urls_max) , FLATTEN(urls_min) 给了我很多我不想要的组合。
我想为每组获得 max + min/2。
为此,我通过以下
投影最大值和最小值的 urlpath_countcomputeMidRange = FOREACH midRangeByUrl generate urls_max::url as mid_url,((DOUBLE)urls_max::urlpath_count+(DOUBLE) urls_min::urlpath_count)/2 as midRange;
我通过以下方式连接两个表
/* Join computeMidRange and UrlXpathsCount */
midRangeJoin = join UrlXpathsCount by url , computeMidRange by mid_url using 'replicated';
midRangeOut = FOREACH midRangeJoin GENERATE UrlXpathsCount::url as url,UrlXpathsCount::leafpathstr as leafpathstr,
UrlXpathsCount::urlpath_count as urlpath_count,computeMidRange::midRange as midRange;
然后过滤应用过滤器
templates = FILTER midRangeOut by urlpath_count > midRange;
我想避开 midRangeJoin 。通过以某种方式计算 midRangeByUrl 并投影以下字段 url, urlpath_count ,leafpathstr , (min+max)/2 没有连接。
请帮我解决这个问题。 谢谢
您可以改用内置 MAX
和 MIN
UDF:
UrlXpathsCount = load 'your_data' using PigStorage(',') as (url: chararray,leafpathstr: chararray,urlpath_count: long);
B = GROUP UrlXpathsCount by url;
C = foreach B generate group as url, MAX(UrlXpathsCount.urlpath_count) as max_count,
MIN(UrlXpathsCount.urlpath_count) as min_count;
D = foreach C generate url, ((double)max_count + (double)min_count)/2 as val;
这将完全按照您的要求进行,无需嵌套的 foreach 或连接。我将计算分为 C
和 D
以避免一行非常长,但你也可以在一行中完成。请记住将值转换为 double
,因为您的 urlpath_count
是一个 long
,所以如果您不转换它,您将不会得到任何小数。