如何使用 HiveQL 在配置单元 table 中将 ngrams 数组字符串和测试频率作为单独的元素获取?

How do I get ngrams array string and estfrequency as seperate elements in a hive table using HiveQL?

我正在分析我自己的推文,我已使用 Hive JSON SerDE 将数据插入 Hive table。我想找出我的推文中所有两个单词短语的频率为 table。输出应该类似于:

phrase             frequency
["the","room"]      1248.0
["a","boy"]        1039.0
["rt","to"]        1032.0
["to","ct"]         986.0

现在,我可以对所有单词短语执行此操作,并且得到的输出为:

phrase     frequency
["the"]     1248.0
["a"]       1039.0
["rt"]      1032.0
["to"]      986.0
["you"]     828.0

对于单词短语输出,我的代码是:

create table ng(new_ar array<struct<ngram:array<string>,estfrequency:double>>);

INSERT OVERWRITE TABLE ng 
SELECT context_ngrams(sentences(lower(text)),array(null),100) as word 
FROM tweets;

create table wordFreq (ngram array<string>,  estfrequency double);

INSERT OVERWRITE TABLE wordFreq 
SELECT X.ngram, X.estfrequency 
FROM ng LATERAL VIEW explode(new_ar) Z as X;    

select * from wordFreq;

如何修改上面的代码以获得我想要的输出?

要将您的代码从 1-gram 更改为 2-gram,请将 array(null) 更改为 array(null,null)

下面的修改将在单独的列中给出这两个词。 您可以将它们连接起来

create table wordFreq (word1 string, word2 string,  estfrequency double);

INSERT OVERWRITE TABLE wordFreq 
SELECT X.ngram[0],X.ngram[1], X.estfrequency 
FROM ng LATERAL VIEW explode(new_ar) Z as X;