从 reg_ex_split_table 输出中去除空白结果
Get rid of blank result from reg_ex_split_table output
我有一个查询,它计算一列中的所有单词,并给出该单词的频率和频率排名作为结果。出于某种原因,我不断收到一行没有字的行。我该如何摆脱它?
Table:
CREATE TABLE dummy (
created_at TIMESTAMPTZ,
tweet TEXT);
插入:
INSERT INTO dummy VALUES ('2020-12-18 00:00:00+00', 'foo squared');
INSERT INTO dummy VALUES ('2020-12-18 00:00:00+00', 'foo foo');
INSERT INTO dummy VALUES ('2020-12-18 00:00:00+00', 'foo foo');
INSERT INTO dummy VALUES ('2020-12-18 00:00:00+00', 'foo bar');
查询:
select *
from (
select date_trunc('day', created_at) as created_day, word, count(*) as cnt,
rank() over(partition by date_trunc('day', created_at) order by count(*) desc) rn
from dummy d
cross join lateral regexp_split_to_table(
regexp_replace(tweet, '\y(rt|co|https|bar|none)\y', '', 'g'),
'\s+'
) w(word)
group by created_day, word
) d
where created_day = CURRENT_DATE and word IS NOT NULL
order by rn
LIMIT 10;
Returns:
created_day | word | cnt | rn
------------------------+---------+-----+----
2020-12-18 00:00:00+00 | foo | 4 | 1
2020-12-18 00:00:00+00 | | 2 | 2
2020-12-18 00:00:00+00 | arm | 1 | 3
2020-12-18 00:00:00+00 | squared | 1 | 3
我想去掉空白字:
created_day | word | cnt | rn
------------------------+---------+-----+----
2020-12-18 00:00:00+00 | foo | 4 | 1
2020-12-18 00:00:00+00 | arm | 1 | 2
2020-12-18 00:00:00+00 | squared | 1 | 3
你可以在 where
子句中使用它吗?
where created_day = CURRENT_DATE
And word is not null -- this
order by rn;
或者你也可以在这里使用相同的条件。
) w(word)
word is not null -- this
group by created_day, word
问题出在内部regexp_replace()
;当匹配部分位于字符串的末尾时,您最终会在字符串的末尾尾随 space 。基本上,当应用于 'foo bar'
时,它会生成 'foo '
.
然后在解析时,这会生成一个值为空字符串 (''
) 的最终单词。
一个简单的解决方法是 trim()
regexp_replace()
的输出,所以基本上替换:
cross join lateral regexp_split_to_table(
regexp_replace(tweet, '\y(rt|co|https|bar|none)\y', '', 'g'),
'\s+'
) w(word)
有:
cross join lateral regexp_split_to_table(
trim(regexp_replace(tweet, '\y(rt|co|https|bar|none)\y', '', 'g')),
'\s+'
) w(word)
我有一个查询,它计算一列中的所有单词,并给出该单词的频率和频率排名作为结果。出于某种原因,我不断收到一行没有字的行。我该如何摆脱它?
Table:
CREATE TABLE dummy (
created_at TIMESTAMPTZ,
tweet TEXT);
插入:
INSERT INTO dummy VALUES ('2020-12-18 00:00:00+00', 'foo squared');
INSERT INTO dummy VALUES ('2020-12-18 00:00:00+00', 'foo foo');
INSERT INTO dummy VALUES ('2020-12-18 00:00:00+00', 'foo foo');
INSERT INTO dummy VALUES ('2020-12-18 00:00:00+00', 'foo bar');
查询:
select *
from (
select date_trunc('day', created_at) as created_day, word, count(*) as cnt,
rank() over(partition by date_trunc('day', created_at) order by count(*) desc) rn
from dummy d
cross join lateral regexp_split_to_table(
regexp_replace(tweet, '\y(rt|co|https|bar|none)\y', '', 'g'),
'\s+'
) w(word)
group by created_day, word
) d
where created_day = CURRENT_DATE and word IS NOT NULL
order by rn
LIMIT 10;
Returns:
created_day | word | cnt | rn
------------------------+---------+-----+----
2020-12-18 00:00:00+00 | foo | 4 | 1
2020-12-18 00:00:00+00 | | 2 | 2
2020-12-18 00:00:00+00 | arm | 1 | 3
2020-12-18 00:00:00+00 | squared | 1 | 3
我想去掉空白字:
created_day | word | cnt | rn
------------------------+---------+-----+----
2020-12-18 00:00:00+00 | foo | 4 | 1
2020-12-18 00:00:00+00 | arm | 1 | 2
2020-12-18 00:00:00+00 | squared | 1 | 3
你可以在 where
子句中使用它吗?
where created_day = CURRENT_DATE
And word is not null -- this
order by rn;
或者你也可以在这里使用相同的条件。
) w(word)
word is not null -- this
group by created_day, word
问题出在内部regexp_replace()
;当匹配部分位于字符串的末尾时,您最终会在字符串的末尾尾随 space 。基本上,当应用于 'foo bar'
时,它会生成 'foo '
.
然后在解析时,这会生成一个值为空字符串 (''
) 的最终单词。
一个简单的解决方法是 trim()
regexp_replace()
的输出,所以基本上替换:
cross join lateral regexp_split_to_table(
regexp_replace(tweet, '\y(rt|co|https|bar|none)\y', '', 'g'),
'\s+'
) w(word)
有:
cross join lateral regexp_split_to_table(
trim(regexp_replace(tweet, '\y(rt|co|https|bar|none)\y', '', 'g')),
'\s+'
) w(word)