Hive 中的分词功能

Question

我正在尝试遵循此示例，其中在 Hive 中计算术语频率和反向文档频率：https://github.com/myui/hivemall/wiki/TFIDF-calculation

我有一个 table 叫做 piggoutputhive，其中有以下字段：

'body' 列包含一串由空格分隔的单词 [a-z A-Z & 0-9]。

我想标记 body 以便我可以生成与 owneruserid 和 body 元组的关系，以便执行 TF-IDF 算法。

我收到一个与 tokenize 函数相关的错误，谁能告诉我哪里出错了？

我的错误如下： Error while compiling statement: FAILED: SemanticException [Error 10011]: Line 8:37 Invalid function 'tokenize' [ERROR_STATUS]

create or replace view pigoutputhive_exploded
as
select
owneruserid, 
body,
score
from
pigoutputhive LATERAL VIEW explode(tokenize(body,true)) t as word
where
not is_stopword(word);

Answer 1

Tokenize 在 Hive 中不起作用，必须使用 sentences() 函数。

Answer 2

tokenize 函数是 Hive 的 Hivemall 扩展。

所以，您需要先安装Hivemall。

有关将 Hivemall 函数加载到 Hive 的信息，请参阅以下页面。 https://github.com/myui/hivemall/wiki/Installation

Hive 中的分词功能

Tokenize Function in Hive

hive

tokenize

tf-idf