hive ngram UDF 使用什么分隔符来标记化?

What separators does the hive ngram UDF use to tokenize?

我正在进行一些情绪分析。

我需要计算文本中的词汇量(不同的词)。

ngram UDF 似乎在确定 unigrams 方面做得很好。我想知道它使用什么分隔符来确定 unigrams/标记。如果我想改为使用拆分 UDF 来模拟词汇量,这很重要。例如,给定以下文本(产品评论)

I was aboslutely shocked to see how much 1 oz really was. At .60, I mistakenly assumed it would be a decent sized can. As locally I am able to buy a medium sized tube of wasabi paste for around , but never used it fast enough so it would get old. I figured a powder would be better, so I can mix it as I needed it. When I opened the box and dug thru the packing and saw this little little can, I started looking for the hidden cameras ... thought this HAD to be a joke. Nope .. and it's NOT returnable either. SO I HAVE LEARNED MY LESSON. Please just be aware if you should decide you want this EXPENSIVE wasabi powder.

ngram UDG 计数 82 个单字词/标记

SELECT count(*) FROM 
(SELECT explode(ngrams(sentences(upper(reviewtext)),1,9999999))  
FROM  amazon.Food_review_part_small WHERE asin = 'B0000CNU1X' AND reviewerid ='A1UCAVBNJUZMPR') t;
82

但是,使用以space、逗号、句点、连字符和双引号作为分隔符的拆分UDF,有85个unigrams/tokens

select  count(distinct(te)) FROM amazon.Food_review_part_small 
lateral view explode(split(upper(reviewtext), '[\s,.-]|\"')) t as te
WHERE te <> '' AND asin = 'B0000CNU1X' AND reviewerid ='A1UCAVBNJUZMPR';
85

当然,我几乎找不到任何文档。有谁知道 ngram UDF 使用什么分隔符来确定 unigram 标记?

UDAF ngram 不拆分数据,实际上它已经期望一个字符串数组或一个字符串数组作为输入。 UDF sentences 在这种情况下拆分数据,来自 java 评论:

+ "Unnecessary punctuation, such as periods and commas in English, is automatically stripped."
+ " If specified, 'lang' should be a two-letter ISO-639 language code (such as 'en'), and "
+ "'country' should be a two-letter ISO-3166 code (such as 'us'). Not all country and "
+ "language codes are fully supported, and if an unsupported code is specified, a default "

如果您运行以下查询

select sentences("I was aboslutely shocked to see how much 1 oz really was. At .60, I mistakenly assumed it would be a decent sized can. As locally I am able to buy a medium sized tube of wasabi paste for around , but never used it fast enough so it would get old. I figured a powder would be better, so I can mix it as I needed it. When I opened the box and dug thru the packing and saw this little little can, I started looking for the hidden cameras ... thought this HAD to be a joke. Nope .. and it's NOT returnable either. SO I HAVE LEARNED MY LESSON. Please just be aware if you should decide you want this EXPENSIVE wasabi powder.");

你会得到如下结果

[["I","was","aboslutely","shocked","to","see","how","much","1","oz","really","was"],["At","I","mistakenly","assumed","it","would","be","a","decent","sized","can"],["As","locally","I","am","able","to","buy","a","medium","sized","tube","of","wasabi","paste","for","around","but","never","used","it","fast","enough","so","it","would","get","old"],["I","figured","a","powder","would","be","better","so","I","can","mix","it","as","I","needed","it"],["When","I","opened","the","box","and","dug","thru","the","packing","and","saw","this","little","little","can","I","started","looking","for","the","hidden","cameras","thought","this","HAD","to","be","a","joke"],["Nope","and","it's","NOT","returnable","either"],["SO","I","HAVE","LEARNED","MY","LESSON"],["Please","just","be","aware","if","you","should","decide","you","want","this","EXPENSIVE","wasabi","powder"]]

如您所见,udf 语句也删除了一些 "noise",例如“$7.60”、“$3”作为空字符串。