Pig - 在蜂巢中存储复杂的关系模式 table
Pig - Store a complex relation schema in a hive table
这是我今天的交易。好吧,在从配置单元读取关系后,我创建了一个关系作为几个转换的结果。问题是我想在 Hive 中进行几次分析后存储最终关系,但我不能。让我的代码更清楚地看到这一点。
第一个字符串是当我从 Hive 加载并转换我的结果时:
july = LOAD 'POC.july' USING org.apache.hive.hcatalog.pig.HCatLoader ;
july_cl = FOREACH july GENERATE GetDay(ToDate(start_date)) as day:int,start_station,duration; jul_cl_fl = FILTER july_cl BY day==31;
july_gr = GROUP jul_cl_fl BY (day,start_station);
july_result = FOREACH july_gr {
total_dura = SUM(jul_cl_fl.duration);
avg_dura = AVG(jul_cl_fl.duration);
qty_trips = COUNT(jul_cl_fl);
GENERATE FLATTEN(group),total_dura,avg_dura,qty_trips;
};
所以,现在当我尝试存储关系时 july_result 我不能,因为模式已经改变,我想它与 Hive 不兼容:
存储 july_result 进入 'poc.july_analysis' 使用 org.apache.hive.hcatalog.pig.HCatStorer ();
即使我尝试为最终关系设置一个特殊的方案我也没有想出来。
july_result = FOREACH july_gr {
total_dura = SUM(jul_cl_fl.duration);
avg_dura = AVG(jul_cl_fl.duration);
qty_trips = COUNT(jul_cl_fl);
GENERATE FLATTEN(group) as (day:int),total_dura as (total_dura:int),avg_dura as (avg_dura:int),qty_trips as (qty_trips:int);
};
经过hortonworks社区的研究,我得到了关于如何在pig中为组关系定义输出格式的解决方案。我的新代码如下所示:
july_result = FOREACH july_gr {
total_dura = SUM(jul_cl_fl.duration);
avg_dura = AVG(jul_cl_fl.duration);
qty_trips = COUNT(jul_cl_fl);
GENERATE FLATTEN( group) AS (day, code_station),(int)total_dura as (total_dura:int),(float)avg_dura as (avg_dura:float),(int)qty_trips as (qty_trips:int);
};
谢谢大家。
这是我今天的交易。好吧,在从配置单元读取关系后,我创建了一个关系作为几个转换的结果。问题是我想在 Hive 中进行几次分析后存储最终关系,但我不能。让我的代码更清楚地看到这一点。
第一个字符串是当我从 Hive 加载并转换我的结果时:
july = LOAD 'POC.july' USING org.apache.hive.hcatalog.pig.HCatLoader ;
july_cl = FOREACH july GENERATE GetDay(ToDate(start_date)) as day:int,start_station,duration; jul_cl_fl = FILTER july_cl BY day==31;
july_gr = GROUP jul_cl_fl BY (day,start_station);
july_result = FOREACH july_gr {
total_dura = SUM(jul_cl_fl.duration);
avg_dura = AVG(jul_cl_fl.duration);
qty_trips = COUNT(jul_cl_fl);
GENERATE FLATTEN(group),total_dura,avg_dura,qty_trips;
};
所以,现在当我尝试存储关系时 july_result 我不能,因为模式已经改变,我想它与 Hive 不兼容:
存储 july_result 进入 'poc.july_analysis' 使用 org.apache.hive.hcatalog.pig.HCatStorer ();
即使我尝试为最终关系设置一个特殊的方案我也没有想出来。
july_result = FOREACH july_gr {
total_dura = SUM(jul_cl_fl.duration);
avg_dura = AVG(jul_cl_fl.duration);
qty_trips = COUNT(jul_cl_fl);
GENERATE FLATTEN(group) as (day:int),total_dura as (total_dura:int),avg_dura as (avg_dura:int),qty_trips as (qty_trips:int);
};
经过hortonworks社区的研究,我得到了关于如何在pig中为组关系定义输出格式的解决方案。我的新代码如下所示:
july_result = FOREACH july_gr {
total_dura = SUM(jul_cl_fl.duration);
avg_dura = AVG(jul_cl_fl.duration);
qty_trips = COUNT(jul_cl_fl);
GENERATE FLATTEN( group) AS (day, code_station),(int)total_dura as (total_dura:int),(float)avg_dura as (avg_dura:float),(int)qty_trips as (qty_trips:int);
};
谢谢大家。