生成子字符串的计数 (Apache Pig)
Generate a count of a substring (Apache Pig)
我正在尝试计算子字符串的特定部分。我的 A 是正确的,但我无法正确使用 B。我包含了实验室的评论以帮助解释代码的某些部分。
data = LOAD '/dualcore/orders' AS (order_id:int,
cust_id:int,
order_dtm:chararray);
/*
* Include only records where the 'order_dtm' field matches
* the regular expression pattern:
*
* ^ = beginning of string
* 2013 = literal value '2013'
* 0[2345] = 0 followed by 2, 3, 4, or 5
* - = a literal character '-'
* \d{2} = exactly two digits
* \s = a single whitespace character
* .* = any number of any characters
* $ = end of string
*
* If you are not familiar with regular expressions and would
* like to know more about them, see the Regular Expression
* Reference at the end of the Exercise Manual.
*/
recent = FILTER data by order_dtm matches '^2013-0[2345]-\d{2}\s.*$';
-- TODO (A): Create a new relation with just the order's year and month
A = FOREACH data GENERATE SUBSTRING(order_dtm,0,7);
-- TODO (B): Count the number of orders in each month
B = FOREACH data GENERATE COUNT_STAR(A);
-- TODO (C): Display the count by month to the screen.
DUMP C;'
您可以通过两种方式解决这个问题。
选项 1:使用您提到的 SUBSTRING
输入
1 100 2013-02-15 test
2 100 2013-04-20 test1
1 101 2013-02-14 test2
1 101 2014-02-27 test3
PigScript:
data = LOAD 'input' AS (order_id:int,cust_id:int,order_dtm:chararray);
recent = FILTER data by order_dtm matches '^2013-0[2345]-\d{2}\s.*$';
A = FOREACH recent GENERATE order_id,cust_id,SUBSTRING(order_dtm,0,4) AS year,SUBSTRING(order_dtm,5,7) AS month;
B = GROUP A BY month;
C = FOREACH B GENERATE group AS month,FLATTEN(A.year) AS year,COUNT(A) AS cnt;
DUMP C;
输出:
(02,2013,2)
(02,2013,2)
(04,2013,1)
选项 2:使用 REGEX 函数
data = LOAD 'input' AS(order_id:int,cust_id:int,order_dtm:chararray);
A = FOREACH data GENERATE order_id,cust_id,FLATTEN(REGEX_EXTRACT_ALL(order_dtm,'^(2013)-(0[2345])-\d{2}\s.*$')) AS (year,month);
B = FILTER A BY month IS NOT NULL;
C = GROUP B BY month;
D = FOREACH C GENERATE group AS month,FLATTEN(B.year) AS year,COUNT(B) AS cnt;
DUMP D;
输出:
(02,2013,2)
(02,2013,2)
(04,2013,1)
在这两种情况下,我都在最终输出中包含了年份,以防万一,如果您不想要,请从脚本中删除 FLATTEN(year)
。
我正在尝试计算子字符串的特定部分。我的 A 是正确的,但我无法正确使用 B。我包含了实验室的评论以帮助解释代码的某些部分。
data = LOAD '/dualcore/orders' AS (order_id:int,
cust_id:int,
order_dtm:chararray);
/*
* Include only records where the 'order_dtm' field matches
* the regular expression pattern:
*
* ^ = beginning of string
* 2013 = literal value '2013'
* 0[2345] = 0 followed by 2, 3, 4, or 5
* - = a literal character '-'
* \d{2} = exactly two digits
* \s = a single whitespace character
* .* = any number of any characters
* $ = end of string
*
* If you are not familiar with regular expressions and would
* like to know more about them, see the Regular Expression
* Reference at the end of the Exercise Manual.
*/
recent = FILTER data by order_dtm matches '^2013-0[2345]-\d{2}\s.*$';
-- TODO (A): Create a new relation with just the order's year and month
A = FOREACH data GENERATE SUBSTRING(order_dtm,0,7);
-- TODO (B): Count the number of orders in each month
B = FOREACH data GENERATE COUNT_STAR(A);
-- TODO (C): Display the count by month to the screen.
DUMP C;'
您可以通过两种方式解决这个问题。
选项 1:使用您提到的 SUBSTRING
输入
1 100 2013-02-15 test
2 100 2013-04-20 test1
1 101 2013-02-14 test2
1 101 2014-02-27 test3
PigScript:
data = LOAD 'input' AS (order_id:int,cust_id:int,order_dtm:chararray);
recent = FILTER data by order_dtm matches '^2013-0[2345]-\d{2}\s.*$';
A = FOREACH recent GENERATE order_id,cust_id,SUBSTRING(order_dtm,0,4) AS year,SUBSTRING(order_dtm,5,7) AS month;
B = GROUP A BY month;
C = FOREACH B GENERATE group AS month,FLATTEN(A.year) AS year,COUNT(A) AS cnt;
DUMP C;
输出:
(02,2013,2)
(02,2013,2)
(04,2013,1)
选项 2:使用 REGEX 函数
data = LOAD 'input' AS(order_id:int,cust_id:int,order_dtm:chararray);
A = FOREACH data GENERATE order_id,cust_id,FLATTEN(REGEX_EXTRACT_ALL(order_dtm,'^(2013)-(0[2345])-\d{2}\s.*$')) AS (year,month);
B = FILTER A BY month IS NOT NULL;
C = GROUP B BY month;
D = FOREACH C GENERATE group AS month,FLATTEN(B.year) AS year,COUNT(B) AS cnt;
DUMP D;
输出:
(02,2013,2)
(02,2013,2)
(04,2013,1)
在这两种情况下,我都在最终输出中包含了年份,以防万一,如果您不想要,请从脚本中删除 FLATTEN(year)
。