从 reg_ex_split_table 输出中去除空白结果

Question

我有一个查询，它计算一列中的所有单词，并给出该单词的频率和频率排名作为结果。出于某种原因，我不断收到一行没有字的行。我该如何摆脱它？

Table:

CREATE TABLE dummy (
created_at TIMESTAMPTZ,
tweet TEXT);

插入：

INSERT INTO dummy VALUES ('2020-12-18 00:00:00+00', 'foo squared');
INSERT INTO dummy VALUES ('2020-12-18 00:00:00+00', 'foo foo');
INSERT INTO dummy VALUES ('2020-12-18 00:00:00+00', 'foo foo');
INSERT INTO dummy VALUES ('2020-12-18 00:00:00+00', 'foo bar');

查询：

select *
from (
    select date_trunc('day', created_at) as created_day, word, count(*) as cnt,
        rank() over(partition by date_trunc('day', created_at) order by count(*) desc) rn
    from dummy d
    cross join lateral regexp_split_to_table(
        regexp_replace(tweet, '\y(rt|co|https|bar|none)\y', '', 'g'),
        '\s+'
    ) w(word)
    group by created_day, word
) d
where created_day = CURRENT_DATE and word IS NOT NULL
order by rn
LIMIT 10;

Returns:

      created_day       |  word   | cnt | rn
------------------------+---------+-----+----
 2020-12-18 00:00:00+00 | foo     |   4 |  1
 2020-12-18 00:00:00+00 |         |   2 |  2
 2020-12-18 00:00:00+00 | arm     |   1 |  3
 2020-12-18 00:00:00+00 | squared |   1 |  3

我想去掉空白字：

      created_day       |  word   | cnt | rn
------------------------+---------+-----+----
 2020-12-18 00:00:00+00 | foo     |   4 |  1
 2020-12-18 00:00:00+00 | arm     |   1 |  2
 2020-12-18 00:00:00+00 | squared |   1 |  3

Answer 1

你可以在 where 子句中使用它吗？

where created_day = CURRENT_DATE 
  And word is not null -- this
order by rn;

或者你也可以在这里使用相同的条件。

) w(word)
word is not null -- this
group by created_day, word

Answer 2

问题出在内部regexp_replace()；当匹配部分位于字符串的末尾时，您最终会在字符串的末尾尾随 space 。基本上，当应用于 'foo bar' 时，它会生成 'foo '.

然后在解析时，这会生成一个值为空字符串 ('') 的最终单词。

一个简单的解决方法是 trim() regexp_replace() 的输出，所以基本上替换：

cross join lateral regexp_split_to_table(
    regexp_replace(tweet, '\y(rt|co|https|bar|none)\y', '', 'g'),
    '\s+'
) w(word)

有：

cross join lateral regexp_split_to_table(
    trim(regexp_replace(tweet, '\y(rt|co|https|bar|none)\y', '', 'g')),
    '\s+'
) w(word)

Demo on DB Fiddle

从 reg_ex_split_table 输出中去除空白结果

Get rid of blank result from reg_ex_split_table output

sql

postgresql

count

greatest-n-per-group

lateral-join