ngrams 单词组合 hive
ngrams combination of words hive
我有一个包含 1000 行和 3 个变量(一个 ID、一个国家/地区和一个字符串变量 "VAR1")的 table。 VAR1 是由 space 分隔的单词组成的句子。
我想要按国家/地区统计所有成对(或所有三胞胎)的单词数。非常重要的一对(或三胞胎)是所有单词的交叉(不一定是一步一步)。也许,我们可以用 ngrams hql 函数来做,但是当我使用它时,它是一步一步地计算单词而不是所有的交叉。
让我们举个例子,让您了解我想要什么:
> **"ID" "COUNTRY" "VAR1"**
> "1" "CANADA" "dad mum child"
> "2" "CANADA" "dad mum dog"
> "3" "USA" "bird lion car"
VAR1不一定是3个字的长度。只是为了简化。
我想要一个 2-ngrams 的 4 个步骤的结果:
第 1 步:最重要的一步:交叉单词 2
> "1" "CANADA" "dad mum" 1
> "1" "CANADA" "dad child" 1
> "1" "CANADA" "mum dad" 1
> "1" "CANADA" "mum child" 1
> "1" "CANADA" "child dad" 1
> "1" "CANADA" "child mum" 1
> "2" "CANADA" "dad mum" 1
> "2" "CANADA" "dad dog" 1
> "2" "CANADA" "mum dad" 1
> "2" "CANADA" "mum dog" 1
> "2" "CANADA" "dog dad" 1
> "2" "CANADA" "dog mum" 1
> "3" "USA" "bird lion" 1
> "3" "USA" "bird car" 1
> "3" "USA" "lion bird" 1
> "3" "USA" "lion car" 1
> "3" "USA" "car bird" 1
> "3" "USA" "car lion" 1
第 2 步:订购 2 克
> "1" "CANADA" "dad mum" 1
> "1" "CANADA" "child dad" 1
> "1" "CANADA" "dad mum" 1
> "1" "CANADA" "child mum" 1
> "1" "CANADA" "child dad" 1
> "1" "CANADA" "child mum" 1
> "2" "CANADA" "dad mum" 1
> "2" "CANADA" "dad dog" 1
> "2" "CANADA" "dad mum" 1
> "2" "CANADA" "dog mum" 1
> "2" "CANADA" "dad dog" 1
> "2" "CANADA" "dog mum" 1
> "3" "USA" "bird lion" 1
> "3" "USA" "bird car" 1
> "3" "USA" "bird lion" 1
> "3" "USA" "car lion" 1
> "3" "USA" "bird car" 1
> "3" "USA" "car lion" 1
第 3 步:按 ID、国家/地区、2-ngrams 区分
> "1" "CANADA" "dad mum"
> "1" "CANADA" "child dad"
> "1" "CANADA" "child mum"
> "2" "CANADA" "dad mum"
> "2" "CANADA" "dad dog"
> "2" "CANADA" "dog mum"
> "3" "USA" "bird lion"
> "3" "USA" "bird car"
> "3" "USA" "car lion"
第 4 步:按国家/地区计数,2-ngrams
> "CANADA" "dad mum" 2
> "CANADA" "child dad" 1
> "CANADA" "child mum" 1
> "CANADA" "dad dog" 1
> "CANADA" "dog mum" 1
> "USA" "bird lion" 1
> "USA" "bird car" 1
> "USA" "car lion" 1
非常感谢你
with cte as
(
select t.ID
,t.COUNTRY
,pe.pos
,pe.val
from mytable t
lateral view posexplode (split(VAR1,'\s+')) pe
)
select t1.COUNTRY
,concat_ws(' ',t1.val,t2.val) as combination
,count (*) as cnt
from cte t1
join cte t2
on t2.id =
t1.id
where t1.pos < t2.pos
group by t1.COUNTRY
,t1.val
,t2.val
;
+----------+--------------+------+
| country | combination | cnt |
+----------+--------------+------+
| CANADA | dad child | 1 |
| CANADA | dad dog | 1 |
| CANADA | dad mum | 2 |
| CANADA | mum child | 1 |
| CANADA | mum dog | 1 |
| USA | bird car | 1 |
| USA | bird lion | 1 |
| USA | lion car | 1 |
+----------+--------------+------+
我有一个包含 1000 行和 3 个变量(一个 ID、一个国家/地区和一个字符串变量 "VAR1")的 table。 VAR1 是由 space 分隔的单词组成的句子。
我想要按国家/地区统计所有成对(或所有三胞胎)的单词数。非常重要的一对(或三胞胎)是所有单词的交叉(不一定是一步一步)。也许,我们可以用 ngrams hql 函数来做,但是当我使用它时,它是一步一步地计算单词而不是所有的交叉。
让我们举个例子,让您了解我想要什么:
> **"ID" "COUNTRY" "VAR1"**
> "1" "CANADA" "dad mum child"
> "2" "CANADA" "dad mum dog"
> "3" "USA" "bird lion car"
VAR1不一定是3个字的长度。只是为了简化。
我想要一个 2-ngrams 的 4 个步骤的结果:
第 1 步:最重要的一步:交叉单词 2
> "1" "CANADA" "dad mum" 1
> "1" "CANADA" "dad child" 1
> "1" "CANADA" "mum dad" 1
> "1" "CANADA" "mum child" 1
> "1" "CANADA" "child dad" 1
> "1" "CANADA" "child mum" 1
> "2" "CANADA" "dad mum" 1
> "2" "CANADA" "dad dog" 1
> "2" "CANADA" "mum dad" 1
> "2" "CANADA" "mum dog" 1
> "2" "CANADA" "dog dad" 1
> "2" "CANADA" "dog mum" 1
> "3" "USA" "bird lion" 1
> "3" "USA" "bird car" 1
> "3" "USA" "lion bird" 1
> "3" "USA" "lion car" 1
> "3" "USA" "car bird" 1
> "3" "USA" "car lion" 1
第 2 步:订购 2 克
> "1" "CANADA" "dad mum" 1
> "1" "CANADA" "child dad" 1
> "1" "CANADA" "dad mum" 1
> "1" "CANADA" "child mum" 1
> "1" "CANADA" "child dad" 1
> "1" "CANADA" "child mum" 1
> "2" "CANADA" "dad mum" 1
> "2" "CANADA" "dad dog" 1
> "2" "CANADA" "dad mum" 1
> "2" "CANADA" "dog mum" 1
> "2" "CANADA" "dad dog" 1
> "2" "CANADA" "dog mum" 1
> "3" "USA" "bird lion" 1
> "3" "USA" "bird car" 1
> "3" "USA" "bird lion" 1
> "3" "USA" "car lion" 1
> "3" "USA" "bird car" 1
> "3" "USA" "car lion" 1
第 3 步:按 ID、国家/地区、2-ngrams 区分
> "1" "CANADA" "dad mum"
> "1" "CANADA" "child dad"
> "1" "CANADA" "child mum"
> "2" "CANADA" "dad mum"
> "2" "CANADA" "dad dog"
> "2" "CANADA" "dog mum"
> "3" "USA" "bird lion"
> "3" "USA" "bird car"
> "3" "USA" "car lion"
第 4 步:按国家/地区计数,2-ngrams
> "CANADA" "dad mum" 2
> "CANADA" "child dad" 1
> "CANADA" "child mum" 1
> "CANADA" "dad dog" 1
> "CANADA" "dog mum" 1
> "USA" "bird lion" 1
> "USA" "bird car" 1
> "USA" "car lion" 1
非常感谢你
with cte as
(
select t.ID
,t.COUNTRY
,pe.pos
,pe.val
from mytable t
lateral view posexplode (split(VAR1,'\s+')) pe
)
select t1.COUNTRY
,concat_ws(' ',t1.val,t2.val) as combination
,count (*) as cnt
from cte t1
join cte t2
on t2.id =
t1.id
where t1.pos < t2.pos
group by t1.COUNTRY
,t1.val
,t2.val
;
+----------+--------------+------+
| country | combination | cnt |
+----------+--------------+------+
| CANADA | dad child | 1 |
| CANADA | dad dog | 1 |
| CANADA | dad mum | 2 |
| CANADA | mum child | 1 |
| CANADA | mum dog | 1 |
| USA | bird car | 1 |
| USA | bird lion | 1 |
| USA | lion car | 1 |
+----------+--------------+------+