如何在 Hive 中生成所有 n-gram
How to generate all n-grams in Hive
我想使用 HiveQL 创建一个 n-gram 列表。我的想法是使用具有先行和拆分功能的正则表达式 - 但是这不起作用:
select split('This is my sentence', '(\S+) +(?=(\S+))');
输入的是表格
的一列
|sentence |
|-------------------------|
|This is my sentence |
|This is another sentence |
输出应该是:
["This is","is my","my sentence"]
["This is","is another","another sentence"]
Hive 中有一个 n-gram udf,但该函数直接计算 n-gram 的频率 - 我想要一个包含所有 n-gram 的列表。
提前致谢!
这可能不是最佳但非常有效的解决方案。按分隔符拆分句子(在我的示例中它是一个或多个 space 或逗号),然后展开并连接以获得 n-gram,然后使用 collect_set
assemble n-gram 数组(如果你需要独特的 n-grams) 或 collect_list
:
with src as
(
select source_data.sentence, words.pos, words.word
from
(--Replace this subquery (source_data) with your table
select stack (2,
'This is my sentence',
'This is another sentence'
) as sentence
) source_data
--split and explode words
lateral view posexplode(split(sentence, '[ ,]+')) words as pos, word
)
select s1.sentence, collect_set(concat_ws(' ',s1.word, s2.word)) as ngrams
from src s1
inner join src s2 on s1.sentence=s2.sentence and s1.pos+1=s2.pos
group by s1.sentence;
结果:
OK
This is another sentence ["This is","is another","another sentence"]
This is my sentence ["This is","is my","my sentence"]
Time taken: 67.832 seconds, Fetched: 2 row(s)
我想使用 HiveQL 创建一个 n-gram 列表。我的想法是使用具有先行和拆分功能的正则表达式 - 但是这不起作用:
select split('This is my sentence', '(\S+) +(?=(\S+))');
输入的是表格
的一列|sentence |
|-------------------------|
|This is my sentence |
|This is another sentence |
输出应该是:
["This is","is my","my sentence"]
["This is","is another","another sentence"]
Hive 中有一个 n-gram udf,但该函数直接计算 n-gram 的频率 - 我想要一个包含所有 n-gram 的列表。
提前致谢!
这可能不是最佳但非常有效的解决方案。按分隔符拆分句子(在我的示例中它是一个或多个 space 或逗号),然后展开并连接以获得 n-gram,然后使用 collect_set
assemble n-gram 数组(如果你需要独特的 n-grams) 或 collect_list
:
with src as
(
select source_data.sentence, words.pos, words.word
from
(--Replace this subquery (source_data) with your table
select stack (2,
'This is my sentence',
'This is another sentence'
) as sentence
) source_data
--split and explode words
lateral view posexplode(split(sentence, '[ ,]+')) words as pos, word
)
select s1.sentence, collect_set(concat_ws(' ',s1.word, s2.word)) as ngrams
from src s1
inner join src s2 on s1.sentence=s2.sentence and s1.pos+1=s2.pos
group by s1.sentence;
结果:
OK
This is another sentence ["This is","is another","another sentence"]
This is my sentence ["This is","is my","my sentence"]
Time taken: 67.832 seconds, Fetched: 2 row(s)