使用 Hive / Spark SQL URL 中的字符串匹配
String matching in URL using Hive / Spark SQL
我有两个 table,一个包含 URL 列表,另一个包含单词列表。我的要求是过滤掉包含单词的URLs。
例如:
URL
https://www.techhive.com/article/3409153/65-inch-oled-4k-tv-from-lg-at-a-1300-dollar-discount.html
https://www.techradar.com/in/news/lg-c9-oled-65-inch-4ktv-price-drop
https://www.t3.com/news/cheap-oled-tv-deals-currys-august
https://indianexpress.com/article/technology/gadgets/lg-bets-big-on-oled-tvs-in-india-to-roll-out-rollable-tv-by-year-end-5823635/
https://www.sony.co.in/electronics/televisions/a1-series
https://www.amazon.in/Sony-138-8-inches-Bravia-KD-55A8F/dp/B07BWKVBYW
https://www.91mobiles.com/list-of-tvs/sony-oled-tv
Words
Sony
Samsung
Deal
Bravia
现在我想过滤任何 URL 包含任何单词的内容。通常我会做一个
Select url from url_table where url not like '%Sony%' or url not like '%Samsung%' or url not like '%Deal%' or not like '%Bravia%';
但这是一种麻烦且不可扩展的方法。实现这一目标的最佳方法是什么?如何对单词 table?
使用 not like 函数
使用正则表达式:
where url not rlike '(?i)Sony|Samsung|Deal|Bravia'
(?i)
表示不区分大小写。
现在让我们从 table 和单词构建相同的正则表达式。
您可以汇总 table 中的单词列表并将其传递给 rlike。看这个例子:
with
initial_data as (--replace with your table
select stack(7,
'https://www.techhive.com/article/3409153/65-inch-oled-4k-tv-from-lg-at-a-1300-dollar-discount.html',
'https://www.techradar.com/in/news/lg-c9-oled-65-inch-4ktv-price-drop',
'https://www.t3.com/news/cheap-oled-tv-deals-currys-august',
'https://indianexpress.com/article/technology/gadgets/lg-bets-big-on-oled-tvs-in-india-to-roll-out-rollable-tv-by-year-end-5823635/',
'https://www.sony.co.in/electronics/televisions/a1-series',
'https://www.amazon.in/Sony-138-8-inches-Bravia-KD-55A8F/dp/B07BWKVBYW',
'https://www.91mobiles.com/list-of-tvs/sony-oled-tv'
) as url ) ,
words as (-- replace with your words table
select stack (4, 'Sony','Samsung','Deal','Bravia') as word
),
sub as (--aggregate list of words for rlike
select concat('''','(?i)',concat_ws('|',collect_set(word)),'''') words_regex from words
)
select s.url
from initial_data s cross join sub --cross join with words_regex
where url not rlike sub.words_regex --rlike works fine
结果:
OK
url
https://www.techhive.com/article/3409153/65-inch-oled-4k-tv-from-lg-at-a-1300-dollar-discount.html
https://www.techradar.com/in/news/lg-c9-oled-65-inch-4ktv-price-drop
https://indianexpress.com/article/technology/gadgets/lg-bets-big-on-oled-tvs-in-india-to-roll-out-rollable-tv-by-year-end-5823635/
Time taken: 10.145 seconds, Fetched: 3 row(s)
您还可以单独计算子子查询并将其结果作为变量传递,而不是在我的示例中进行交叉连接。希望你明白了。
我有两个 table,一个包含 URL 列表,另一个包含单词列表。我的要求是过滤掉包含单词的URLs。 例如:
URL
https://www.techhive.com/article/3409153/65-inch-oled-4k-tv-from-lg-at-a-1300-dollar-discount.html
https://www.techradar.com/in/news/lg-c9-oled-65-inch-4ktv-price-drop
https://www.t3.com/news/cheap-oled-tv-deals-currys-august
https://indianexpress.com/article/technology/gadgets/lg-bets-big-on-oled-tvs-in-india-to-roll-out-rollable-tv-by-year-end-5823635/
https://www.sony.co.in/electronics/televisions/a1-series
https://www.amazon.in/Sony-138-8-inches-Bravia-KD-55A8F/dp/B07BWKVBYW
https://www.91mobiles.com/list-of-tvs/sony-oled-tv
Words
Sony
Samsung
Deal
Bravia
现在我想过滤任何 URL 包含任何单词的内容。通常我会做一个
Select url from url_table where url not like '%Sony%' or url not like '%Samsung%' or url not like '%Deal%' or not like '%Bravia%';
但这是一种麻烦且不可扩展的方法。实现这一目标的最佳方法是什么?如何对单词 table?
使用 not like 函数使用正则表达式:
where url not rlike '(?i)Sony|Samsung|Deal|Bravia'
(?i)
表示不区分大小写。
现在让我们从 table 和单词构建相同的正则表达式。
您可以汇总 table 中的单词列表并将其传递给 rlike。看这个例子:
with
initial_data as (--replace with your table
select stack(7,
'https://www.techhive.com/article/3409153/65-inch-oled-4k-tv-from-lg-at-a-1300-dollar-discount.html',
'https://www.techradar.com/in/news/lg-c9-oled-65-inch-4ktv-price-drop',
'https://www.t3.com/news/cheap-oled-tv-deals-currys-august',
'https://indianexpress.com/article/technology/gadgets/lg-bets-big-on-oled-tvs-in-india-to-roll-out-rollable-tv-by-year-end-5823635/',
'https://www.sony.co.in/electronics/televisions/a1-series',
'https://www.amazon.in/Sony-138-8-inches-Bravia-KD-55A8F/dp/B07BWKVBYW',
'https://www.91mobiles.com/list-of-tvs/sony-oled-tv'
) as url ) ,
words as (-- replace with your words table
select stack (4, 'Sony','Samsung','Deal','Bravia') as word
),
sub as (--aggregate list of words for rlike
select concat('''','(?i)',concat_ws('|',collect_set(word)),'''') words_regex from words
)
select s.url
from initial_data s cross join sub --cross join with words_regex
where url not rlike sub.words_regex --rlike works fine
结果:
OK
url
https://www.techhive.com/article/3409153/65-inch-oled-4k-tv-from-lg-at-a-1300-dollar-discount.html
https://www.techradar.com/in/news/lg-c9-oled-65-inch-4ktv-price-drop
https://indianexpress.com/article/technology/gadgets/lg-bets-big-on-oled-tvs-in-india-to-roll-out-rollable-tv-by-year-end-5823635/
Time taken: 10.145 seconds, Fetched: 3 row(s)
您还可以单独计算子子查询并将其结果作为变量传递,而不是在我的示例中进行交叉连接。希望你明白了。