如何在 Hive 中生成所有 n-gram

Question

我想使用 HiveQL 创建一个 n-gram 列表。我的想法是使用具有先行和拆分功能的正则表达式 - 但是这不起作用：

select split('This is my sentence', '(\S+) +(?=(\S+))');

输入的是表格

的一列

|sentence                 |
|-------------------------|
|This is my sentence      |
|This is another sentence |

输出应该是：

["This is","is my","my sentence"]
["This is","is another","another sentence"]

Hive 中有一个 n-gram udf，但该函数直接计算 n-gram 的频率 - 我想要一个包含所有 n-gram 的列表。

提前致谢！

Answer 1

这可能不是最佳但非常有效的解决方案。按分隔符拆分句子（在我的示例中它是一个或多个 space 或逗号），然后展开并连接以获得 n-gram，然后使用 collect_set assemble n-gram 数组（如果你需要独特的 n-grams) 或 collect_list:

with src as 
(
select source_data.sentence, words.pos, words.word
  from
      (--Replace this subquery (source_data) with your table
       select stack (2,
                     'This is my sentence', 
                     'This is another sentence'
                     ) as sentence
      ) source_data 
        --split and explode words
        lateral view posexplode(split(sentence, '[ ,]+')) words as pos, word
)

select s1.sentence, collect_set(concat_ws(' ',s1.word, s2.word)) as ngrams 
      from src s1 
           inner join src s2 on s1.sentence=s2.sentence and s1.pos+1=s2.pos              
group by s1.sentence;

结果：

OK
This is another sentence        ["This is","is another","another sentence"]
This is my sentence             ["This is","is my","my sentence"]
Time taken: 67.832 seconds, Fetched: 2 row(s)

如何在 Hive 中生成所有 n-gram

How to generate all n-grams in Hive

sql

hadoop

hive

n-gram

hiveql