PIG 中的字数统计
Word count in PIG
假设我有一个文本文件名count.txt,其中包含下面提到的段落
I am working in hadoop along with various courses like Hadoop, Hana, Java etc
I love working with hadoop
This is hadoop project
现在我需要得到 hadoop 这个词在上面的文件中出现了多少次
下面的代码是我试过的
c1= load '/...../count.txt' using PigStorage(',') as (Name:chararray);
c2 = foreach c1 generate FLATTEN(TOKENIZE(LOWER(Name)))as (Name1:chararray);
dump c2;
c3 = filter c2 by Name1=='hadoop';
dump c3;
这里的输出是
(hadoop)
(hadoop)
(hadoop)
(hadoop)
我需要的是数字4,而不是hadoop这个词重复了4次。因此我尝试执行
`c4 = foreach c3 generate COUNT([=13=]);`
出现错误。请帮助我,这可能是一件我找不到的简单事情。
提前致谢。
试试这个:
就做一组c2:
c3 = filter c2 by Name1=='hadoop'
grouped = GROUP c3 BY Name1;
wordcount = FOREACH grouped GENERATE [=10=], COUNT();
DUMP wordcount
如果有帮助请告诉我。
假设我有一个文本文件名count.txt,其中包含下面提到的段落
I am working in hadoop along with various courses like Hadoop, Hana, Java etc
I love working with hadoop
This is hadoop project
现在我需要得到 hadoop 这个词在上面的文件中出现了多少次
下面的代码是我试过的
c1= load '/...../count.txt' using PigStorage(',') as (Name:chararray);
c2 = foreach c1 generate FLATTEN(TOKENIZE(LOWER(Name)))as (Name1:chararray);
dump c2;
c3 = filter c2 by Name1=='hadoop';
dump c3;
这里的输出是
(hadoop)
(hadoop)
(hadoop)
(hadoop)
我需要的是数字4,而不是hadoop这个词重复了4次。因此我尝试执行
`c4 = foreach c3 generate COUNT([=13=]);`
出现错误。请帮助我,这可能是一件我找不到的简单事情。 提前致谢。
试试这个:
就做一组c2:
c3 = filter c2 by Name1=='hadoop'
grouped = GROUP c3 BY Name1;
wordcount = FOREACH grouped GENERATE [=10=], COUNT();
DUMP wordcount
如果有帮助请告诉我。