如何在 Pig 中加入两个数据集并分组后找到平均值
How to find average after joining two datasets and grouping, in Pig
我有两个数据集,包含 4 列(id、姓名、性别、位置)的 EmployeeDetail 和 SalaryDetail(id, salary)。我加入了两个数据集并将它们分组为位置。
EmpDetail = load '/Users/bmohanty6/EmployeeDetails/EmpDetail.txt' as (id:int, name:chararray, gender:chararray, location:chararray);
SalaryDetail = load '/Users/bmohanty6/EmployeeDetails/EmpSalary.txt' as (id:int, salary:float);
JoinedEmpDetail = join EmpDetail by id, SalaryDetail by id;
GroupedByLocation = group JoinedEmpDetail by location;
DUMP GroupedByLocation 给了我预期的正确结果。现在,当我尝试使用下面的行取平均值时,
AverageSalary = foreach GroupedByLocation generate group, AVG(SalaryDetail.salary);
抛出以下错误。
<line 11, column 58> Could not infer the matching function for org.apache.pig.builtin.AVG as multiple or none of them fit. Please use an explicit cast.
我也试过下面的方法。但是得到了同样的错误。
AverageSalary = foreach GroupedByLocation {
Sum = SUM(SalaryDetail.salary);
Count = COUNT(SalaryDetail.salary);
avgSal = Sum/Count;
generate group as location, avgSal;
};
这次错误是:
Could not infer the matching function for org.apache.pig.builtin.SUM as multiple or none of them fit. Please use an explicit cast.
谁能告诉我正确的方法。
感谢 Sivasakthi Jayaraman 回答我的问题。
AverageSalary = foreach GroupedByLocation generate group, AVG(JoinedEmpDetail.SalaryDetail::salary);
这给出了每个地点的平均工资。
现在我试图找出每个 location
中每个性别的平均工资。所以我尝试在 GroupedByLocation
变量中按 gender
分组。但面临一些问题。
GroupdByGender = foreach GroupedByLocation {
genderGrp = group JoinedEmpDetail by JoinedEmpDetail.EmpDetail::gender;
avgSalary = foreach genderGrp generate group, AVG(JoinedEmpDetail.SalaryDetail::salary);
generate group as location, JoinedEmpDetail.EmpDetail::gender, avgSalary;
};
我遇到了这个错误
Syntax error, unexpected symbol at or near 'JoinedEmpDetail'
谁能帮忙。
您不能像这样访问 salary
列,首先您需要映射 JoinedEmpDetail
关系,然后访问 salary
列。
你能试试下面的stmt吗?
AverageSalary = foreach GroupedByLocation generate group, AVG(JoinedEmpDetail.SalaryDetail::salary);
我有两个数据集,包含 4 列(id、姓名、性别、位置)的 EmployeeDetail 和 SalaryDetail(id, salary)。我加入了两个数据集并将它们分组为位置。
EmpDetail = load '/Users/bmohanty6/EmployeeDetails/EmpDetail.txt' as (id:int, name:chararray, gender:chararray, location:chararray);
SalaryDetail = load '/Users/bmohanty6/EmployeeDetails/EmpSalary.txt' as (id:int, salary:float);
JoinedEmpDetail = join EmpDetail by id, SalaryDetail by id;
GroupedByLocation = group JoinedEmpDetail by location;
DUMP GroupedByLocation 给了我预期的正确结果。现在,当我尝试使用下面的行取平均值时,
AverageSalary = foreach GroupedByLocation generate group, AVG(SalaryDetail.salary);
抛出以下错误。
<line 11, column 58> Could not infer the matching function for org.apache.pig.builtin.AVG as multiple or none of them fit. Please use an explicit cast.
我也试过下面的方法。但是得到了同样的错误。
AverageSalary = foreach GroupedByLocation {
Sum = SUM(SalaryDetail.salary);
Count = COUNT(SalaryDetail.salary);
avgSal = Sum/Count;
generate group as location, avgSal;
};
这次错误是:
Could not infer the matching function for org.apache.pig.builtin.SUM as multiple or none of them fit. Please use an explicit cast.
谁能告诉我正确的方法。
感谢 Sivasakthi Jayaraman 回答我的问题。
AverageSalary = foreach GroupedByLocation generate group, AVG(JoinedEmpDetail.SalaryDetail::salary);
这给出了每个地点的平均工资。
现在我试图找出每个 location
中每个性别的平均工资。所以我尝试在 GroupedByLocation
变量中按 gender
分组。但面临一些问题。
GroupdByGender = foreach GroupedByLocation {
genderGrp = group JoinedEmpDetail by JoinedEmpDetail.EmpDetail::gender;
avgSalary = foreach genderGrp generate group, AVG(JoinedEmpDetail.SalaryDetail::salary);
generate group as location, JoinedEmpDetail.EmpDetail::gender, avgSalary;
};
我遇到了这个错误
Syntax error, unexpected symbol at or near 'JoinedEmpDetail'
谁能帮忙。
您不能像这样访问 salary
列,首先您需要映射 JoinedEmpDetail
关系,然后访问 salary
列。
你能试试下面的stmt吗?
AverageSalary = foreach GroupedByLocation generate group, AVG(JoinedEmpDetail.SalaryDetail::salary);