SQL 中的一元字母组

Question

我有以下 table (tbl)

id | text
1  | example text
2  | this is an example text
3  | text text text

我想return这个table作为输出（unigrams）

ngram   | counts | n_ids
text    |  5     | 3
example |  2     | 2
this    |  1     | 1
is      |  1     | 1
an      |  1     | 1

我想到了使用交叉连接来解决这个问题（我在使用 Presto）。

WITH 
  ngram_array AS (
  SELECT id, ngrams(split(text, ' ')) ngram_array FROM tbl
 ),
SELECT
  array_join(ngram, ' ') ngram,
  count(*) as counts,
  count(id) as n_ids
FROM ngram_array CROSS JOIN UNNEST (ngram_array) AS t(ngram)
GROUP BY ngram

这似乎给了我 ngram，但是 counts 和 n_ids 列具有相同的值，而我预计会有差异，因为一次是 ngram 在整个样本中的计数，第二个一个是每个 ngram 存在的文档数。

你知道我可能做错了什么吗？有没有 fiddle 我可以在线测试它（我知道 fiddle 用于 Postgres，但找不到用于 Presto 的）。

Answer 1

您可以根据需要将文本拆分为字符串数组，unnest 它并使用 distinct 选项 count 用于 group by 中的 ID:

-- sample data
WITH dataset (id, text) AS (
    VALUES (1,   'example text'),
        (2,   'this is an example text'),
        (3,   'text text text')
)

--query
SELECT word,
    count(*) counts,
    count(distinct id) n_ids -- count distinct ids
FROM (
        SELECT id,
            word
        FROM dataset
            CROSS JOIN UNNEST (split(text, ' ')) as t(word)
    )
GROUP BY word
ORDER BY counts desc -- order for output

输出：

word	counts	n_ids
text	5	3
example	2	2
this	1	1
is	1	1
an	1	1

SQL 中的一元字母组

unigrams in SQL

sql

cross-join

n-gram

presto