计算布尔列的百分比
Calculate percentage on boolean column
假设我的数据具有以下结构:
Year | Location | New_client
2018 | Paris | true
2018 | Paris | true
2018 | Paris | false
2018 | London | true
2018 | Madrid | true
2018 | Madrid | false
2017 | Paris | true
我正在尝试计算每年和位置的真值百分比 New_client,因此从结构示例中获取记录的示例将是
2018 | Paris | 66
2018 | London | 100
2018 | Madrid | 50
2017 | Paris | 100
改编自 我当前的脚本是,但不同之处在于它使用 2 列(年份和位置)而不是 1 列
data = load...
grp = group inpt by Year; -- creates bags for each value in col1 (Year)
result = FOREACH grp {
total = COUNT(data);
t = FILTER data BY New_client == 'true'; --create a bag which contains only T values
GENERATE FLATTEN(group) AS Year, total AS TOTAL_ROWS_IN_INPUT_TABLE, 100*(double)COUNT(t)/(double)total AS PERCENTAGE_TRUE_IN_INPUT_TABLE;
};
问题是这使用年份作为参考,而我需要它是年份和地区。
感谢您的帮助。
您需要同时按Year
和Location
进行分组,这需要进行两次修改。首先,将 Location
添加到 group by 语句中。其次,将 FLATTEN(group) AS Year
更改为 FLATTEN(group) AS (Year, Location)
,因为 group
现在是一个包含两个字段的元组。
grp = group inpt by (Year, Location);
result = FOREACH grp {
total = COUNT(inpt);
t = FILTER inpt BY New_client == 'true';
GENERATE
FLATTEN(group) AS (Year, Location),
total AS TOTAL_ROWS_IN_INPUT_TABLE,
100*(double)COUNT(t)/(double)total AS PERCENTAGE_TRUE_IN_INPUT_TABLE;
};
测试了这段代码,看起来对我有用:
A = LOAD ...
B = GROUP A BY (year, location);
C = FOREACH B {
TRUE_CNT = FILTER A BY (chararray)new_client == 'true';
GENERATE group.year, group.location, (int)((float)COUNT(TRUE_CNT) / COUNT(A) * 100);
}
DUMP C;
(2017,Paris,100)
(2018,Paris,66)
(2018,London,100)
(2018,Madrid,50)
假设我的数据具有以下结构:
Year | Location | New_client
2018 | Paris | true
2018 | Paris | true
2018 | Paris | false
2018 | London | true
2018 | Madrid | true
2018 | Madrid | false
2017 | Paris | true
我正在尝试计算每年和位置的真值百分比 New_client,因此从结构示例中获取记录的示例将是
2018 | Paris | 66
2018 | London | 100
2018 | Madrid | 50
2017 | Paris | 100
改编自 我当前的脚本是,但不同之处在于它使用 2 列(年份和位置)而不是 1 列
data = load...
grp = group inpt by Year; -- creates bags for each value in col1 (Year)
result = FOREACH grp {
total = COUNT(data);
t = FILTER data BY New_client == 'true'; --create a bag which contains only T values
GENERATE FLATTEN(group) AS Year, total AS TOTAL_ROWS_IN_INPUT_TABLE, 100*(double)COUNT(t)/(double)total AS PERCENTAGE_TRUE_IN_INPUT_TABLE;
};
问题是这使用年份作为参考,而我需要它是年份和地区。
感谢您的帮助。
您需要同时按Year
和Location
进行分组,这需要进行两次修改。首先,将 Location
添加到 group by 语句中。其次,将 FLATTEN(group) AS Year
更改为 FLATTEN(group) AS (Year, Location)
,因为 group
现在是一个包含两个字段的元组。
grp = group inpt by (Year, Location);
result = FOREACH grp {
total = COUNT(inpt);
t = FILTER inpt BY New_client == 'true';
GENERATE
FLATTEN(group) AS (Year, Location),
total AS TOTAL_ROWS_IN_INPUT_TABLE,
100*(double)COUNT(t)/(double)total AS PERCENTAGE_TRUE_IN_INPUT_TABLE;
};
测试了这段代码,看起来对我有用:
A = LOAD ...
B = GROUP A BY (year, location);
C = FOREACH B {
TRUE_CNT = FILTER A BY (chararray)new_client == 'true';
GENERATE group.year, group.location, (int)((float)COUNT(TRUE_CNT) / COUNT(A) * 100);
}
DUMP C;
(2017,Paris,100)
(2018,Paris,66)
(2018,London,100)
(2018,Madrid,50)