用于文本字段上词频计数的 clickhouse 方法
clickhouse approach for word frequency count on textual field
我有一个 Clickhouse table,其中一个字段包含文本描述(~300 字)。
例如评论:
Rev_id Place_id Stars Category Text
1 12 3 Food Nice food but a bad dirty place.
2 31 4 Sport Not bad, they have everything.
3 55 1 Bar Poor place,bad audience.
我想做一些字数分析,比如一般的词频统计(每个词出现了多少次)或者每个类别的前 K 个词。
示例中:
word count
bad 3
place 2
...
有没有办法只在 ClickHouse 中完成而不涉及编程语言?
SELECT
arrayJoin(splitByChar(' ', replaceRegexpAll(x, '[.,]', ' '))) AS w,
count()
FROM
(
SELECT 'Nice food but a bad dirty place.' AS x
UNION ALL
SELECT 'Not bad, they have everything.'
UNION ALL
SELECT 'Poor place,bad audience.'
)
GROUP BY w
ORDER BY count() DESC
┌─w──────────┬─count()─┐
│ │ 4 │
│ bad │ 3 │
│ place │ 2 │
│ have │ 1 │
│ Poor │ 1 │
│ food │ 1 │
│ Not │ 1 │
│ they │ 1 │
│ audience │ 1 │
│ Nice │ 1 │
│ but │ 1 │
│ dirty │ 1 │
│ a │ 1 │
│ everything │ 1 │
└────────────┴─────────┘
SELECT CATEGORY, ....
GROUP BY CATEGORY, w
如果它适用于您的情况,我会考虑使用 alphaTokens 作为更有效的方法。
SELECT
category,
arrayJoin(arrayFilter(x -> NOT has(['a', 'the', 'but' /*.. exclude stopwords */], x), alphaTokens(text))) token,
count() count
FROM
(
/* test data */
SELECT data.1 AS rev_id, data.2 AS place_id, data.3 AS stars, data.4 AS category, data.5 AS text
FROM
(
SELECT arrayJoin([
(1, 12, 3, 'Food', 'Nice food but a bad dirty place.'),
(4, 12, 3, 'Food', ' the the the the good food ..'),
(2, 31, 4, 'Sport', 'Not bad,,, they have everything.'),
(3, 55, 1, 'Bar', 'Poor place,bad audience..')]) AS data
)
)
GROUP BY category, token
ORDER BY count DESC
LIMIT 5;
/*
┌─category─┬─token────┬─count─┐
│ Food │ food │ 2 │
│ Food │ bad │ 1 │
│ Bar │ audience │ 1 │
│ Food │ Nice │ 1 │
│ Bar │ Poor │ 1 │
└──────────┴──────────┴───────┘
*/
使用示例topK:
SELECT
category,
arrayReduce('topK(3)',
arrayFilter(x -> (NOT has(['a', 'the', 'but' /*.. exclude stopwords */], x)), groupArrayArray(alphaTokens(text)))) AS result
FROM
(
/* test data */
SELECT data.1 AS rev_id, data.2 AS place_id, data.3 AS stars, data.4 AS category, data.5 AS text
FROM
(
SELECT arrayJoin([
(1, 12, 3, 'Food', 'Nice food but a bad dirty place.'),
(4, 12, 3, 'Food', ' the the the the good food ..'),
(2, 31, 4, 'Sport', 'Not bad,,, they have everything.'),
(3, 55, 1, 'Bar', 'Poor place,bad audience..')]) AS data
)
)
GROUP BY category;
/* result
┌─category─┬─result─────────────────┐
│ Bar │ ['Poor','place','bad'] │
│ Food │ ['food','Nice','bad'] │
│ Sport │ ['Not','bad','they'] │
└──────────┴────────────────────────┘
*/
ps:可能对 lower 所有 strings/tokens 在处理
之前有意义
我有一个 Clickhouse table,其中一个字段包含文本描述(~300 字)。
例如评论:
Rev_id Place_id Stars Category Text
1 12 3 Food Nice food but a bad dirty place.
2 31 4 Sport Not bad, they have everything.
3 55 1 Bar Poor place,bad audience.
我想做一些字数分析,比如一般的词频统计(每个词出现了多少次)或者每个类别的前 K 个词。
示例中:
word count
bad 3
place 2
... 有没有办法只在 ClickHouse 中完成而不涉及编程语言?
SELECT
arrayJoin(splitByChar(' ', replaceRegexpAll(x, '[.,]', ' '))) AS w,
count()
FROM
(
SELECT 'Nice food but a bad dirty place.' AS x
UNION ALL
SELECT 'Not bad, they have everything.'
UNION ALL
SELECT 'Poor place,bad audience.'
)
GROUP BY w
ORDER BY count() DESC
┌─w──────────┬─count()─┐
│ │ 4 │
│ bad │ 3 │
│ place │ 2 │
│ have │ 1 │
│ Poor │ 1 │
│ food │ 1 │
│ Not │ 1 │
│ they │ 1 │
│ audience │ 1 │
│ Nice │ 1 │
│ but │ 1 │
│ dirty │ 1 │
│ a │ 1 │
│ everything │ 1 │
└────────────┴─────────┘
SELECT CATEGORY, ....
GROUP BY CATEGORY, w
如果它适用于您的情况,我会考虑使用 alphaTokens 作为更有效的方法。
SELECT
category,
arrayJoin(arrayFilter(x -> NOT has(['a', 'the', 'but' /*.. exclude stopwords */], x), alphaTokens(text))) token,
count() count
FROM
(
/* test data */
SELECT data.1 AS rev_id, data.2 AS place_id, data.3 AS stars, data.4 AS category, data.5 AS text
FROM
(
SELECT arrayJoin([
(1, 12, 3, 'Food', 'Nice food but a bad dirty place.'),
(4, 12, 3, 'Food', ' the the the the good food ..'),
(2, 31, 4, 'Sport', 'Not bad,,, they have everything.'),
(3, 55, 1, 'Bar', 'Poor place,bad audience..')]) AS data
)
)
GROUP BY category, token
ORDER BY count DESC
LIMIT 5;
/*
┌─category─┬─token────┬─count─┐
│ Food │ food │ 2 │
│ Food │ bad │ 1 │
│ Bar │ audience │ 1 │
│ Food │ Nice │ 1 │
│ Bar │ Poor │ 1 │
└──────────┴──────────┴───────┘
*/
使用示例topK:
SELECT
category,
arrayReduce('topK(3)',
arrayFilter(x -> (NOT has(['a', 'the', 'but' /*.. exclude stopwords */], x)), groupArrayArray(alphaTokens(text)))) AS result
FROM
(
/* test data */
SELECT data.1 AS rev_id, data.2 AS place_id, data.3 AS stars, data.4 AS category, data.5 AS text
FROM
(
SELECT arrayJoin([
(1, 12, 3, 'Food', 'Nice food but a bad dirty place.'),
(4, 12, 3, 'Food', ' the the the the good food ..'),
(2, 31, 4, 'Sport', 'Not bad,,, they have everything.'),
(3, 55, 1, 'Bar', 'Poor place,bad audience..')]) AS data
)
)
GROUP BY category;
/* result
┌─category─┬─result─────────────────┐
│ Bar │ ['Poor','place','bad'] │
│ Food │ ['food','Nice','bad'] │
│ Sport │ ['Not','bad','they'] │
└──────────┴────────────────────────┘
*/
ps:可能对 lower 所有 strings/tokens 在处理
之前有意义