Pig 在 Spark 2.0 中嵌套 foreach
Pig nested foreach in Spark 2.0
我正在尝试将 Pig 脚本转换为 Spark 2 例程。
在 groupBy
中,我想计算与特定状态匹配的元素数。 PIG 代码如下所示:
A = foreach (group payment by customer) {
done = filter payment by state == 'done';
doing = filter payment by state == 'doing';
cancelled = filter payment by ETAT == 'cancelled';
generate group as customer, COUNT(done) as nb_done, COUNT(doing) as nb_doing, COUNT(cancelled) as nb_cancelled;
};
我想将其调整为以 payment.groupBy("customer")
开头的数据框。
谢谢!
试试类似的东西:
假设客户 table 在具有以下架构的 spark 会话中注册:
customer.registerTempTable("customer");
sparkSession.sql("describe customer").show();
+--------+---------+-------+
|col_name|data_type|comment|
+--------+---------+-------+
| id| string| null|
| state| string| null|
+--------+---------+-------+
--使用地图分组
sparkSession.sql("select id, count(state['done']) as done," +
"count(state['doing']) as doing," +
"count(state['cancelled']) as cancelled
from (select id,map(state,1) as state from customer) t group by id").show();
我正在尝试将 Pig 脚本转换为 Spark 2 例程。
在 groupBy
中,我想计算与特定状态匹配的元素数。 PIG 代码如下所示:
A = foreach (group payment by customer) {
done = filter payment by state == 'done';
doing = filter payment by state == 'doing';
cancelled = filter payment by ETAT == 'cancelled';
generate group as customer, COUNT(done) as nb_done, COUNT(doing) as nb_doing, COUNT(cancelled) as nb_cancelled;
};
我想将其调整为以 payment.groupBy("customer")
开头的数据框。
谢谢!
试试类似的东西:
假设客户 table 在具有以下架构的 spark 会话中注册:
customer.registerTempTable("customer");
sparkSession.sql("describe customer").show();
+--------+---------+-------+
|col_name|data_type|comment|
+--------+---------+-------+
| id| string| null|
| state| string| null|
+--------+---------+-------+
--使用地图分组
sparkSession.sql("select id, count(state['done']) as done," +
"count(state['doing']) as doing," +
"count(state['cancelled']) as cancelled
from (select id,map(state,1) as state from customer) t group by id").show();