尝试在单独的 table 字段中找到精确的单词匹配,考虑否定词

Trying to find exact word match within separate table field, accounting for negative words

我尝试了很多不同的查询来解决这个问题,但它变得一团糟。长话短说,我正在尝试根据 3 个单独的关键字并排除任何包含否定关键字的匹配项来找到完全匹配的词(用空格分隔的孤立词)。

field_name_1、field_name_2、field_name_3都是肯定的词。 negative_keywords 是一组以逗号分隔的词,这些词首先被拆分,然后用于否定 ut.title 包含否定关键字的任何结果。

本质上查询是在问:"Find where ut.title has either field_name_1, field_name_2, or field_name_3 but at the same time does not have a word from the split negative_keywords field."

非常感谢任何帮助。不幸的是,正则表达式似乎是不可能的,因为 field_name_x 是常量。提前致谢!

我当前过度膨胀的查询如下:

SELECT ut.i_id as i_id, up.id AS p_id, up.option_id as option_id
    FROM ds_test.table_1 AS ut 
    CROSS JOIN 
(
SELECT field_name_1, field_name_2, field_name_3, SPLIT(negative_keywords ,",")  as negative_keywords, option_id, id
FROM ds_test.table_2 ) AS up 

    WHERE 
(
(ut.title contains " "+up.field_name_1+" ")  or 
(LEFT(ut.title, LENGTH(up.field_name_1+" ")) contains up.field_name_1+" ")  or
(RIGHT(ut.title, LENGTH(" "+up.field_name_1)) contains " "+up.field_name_1)  or
(ut.title contains " "+up.field_name_2+" ")  or 
(LEFT(ut.title, LENGTH(up.field_name_2+" ")) contains up.field_name_2+" ")  or
(RIGHT(ut.title, LENGTH(" "+up.field_name_2)) contains " "+up.field_name_2)  or
(ut.title contains " "+up.field_name_3+" ")  or 
(LEFT(ut.title, LENGTH(up.field_name_3+" ")) contains up.field_name_3+" ")  or
(RIGHT(ut.title, LENGTH(" "+up.field_name_3)) contains " "+up.field_name_3) or
(ut.title CONTAINS CONCAT(SUBSTR(up.field_name_1, 1 , LENGTH(up.field_name_1))," "))  or  
(ut.title CONTAINS CONCAT(SUBSTR(up.field_name_2, 1 , LENGTH(up.field_name_2))," "))  or  
(ut.title CONTAINS CONCAT(SUBSTR(up.field_name_3, 1 , LENGTH(up.field_name_3))," ")) 
and (NOT ut.title CONTAINS CONCAT(SUBSTR(up.negative_keywords, 1 , LENGTH(up.negative_keywords))," ")) 
)

GROUP EACH BY i_id, p_id, option_id

IGNORE CASE

例如:

In table ds_test.table_1: 字段标题包含 "The X301-p and x301-b are Top of the charts"

在tableds_test.table_2中:field_name_1、field_name_2、field_name_3、negative_keywords分别是:

ROW 1 = |x301-f|x301p|x301-p|x301-a,x301-c|

ROW 2 = |x301-b|x301b|x301-d|x301-h,x301-p|

ROW 3 = |x301  |x30  |      |             |

第 1 行为真。有x301-p,none个否定关键词也在标题中。

第 2 行将是错误的。尽管标题中有x301-b,但也有x301-p作为否定关键字。

第 3 行是错误的。尽管标题中有 x301 and/or x30,但它们匹配,因为它们是 X301-p 或 X301-b 的子字符串,因此 x301 或 x30 不是标题中的完整单个单词。

想法是:

  • 将否定关键字拆分到重复字段中
  • 删除否定词 使用 OMIT RECORD IF SOME(title CONTAINS negative) 构造
  • 使用 CONTAINS 将完整的单词与周围的空格匹配,或者使用带有 LIKE
  • 的自定义模式来捕获字符串的 beginning/end

使用您的示例中的数据将其放在一起:

SELECT title, field_1, field_2, field_3 FROM (
SELECT title, field_1, field_2, field_3, SPLIT(table2.negative) negative FROM
(SELECT * FROM 
 (SELECT 'The x301-b tops the x301-p' title),
 (SELECT 'The X301-p and x301-b are Top of the charts' title)) table1
CROSS JOIN
(SELECT * FROM
(SELECT 'x301-f' field_1, 'x301p' field_2, 'x301-p' field_3, 'x301-a,x301-c' negative),
(SELECT 'x301-b' field_1, 'x301b' field_2, 'x301-d' field_3, 'x301-h,x301-p' negative),
(SELECT 'x301'   field_1, 'x30'   field_2, '' field_3, '' negative)) table2
)
WHERE title CONTAINS ' ' + field_1 + ' ' OR title LIKE '% ' + field_1 OR title LIKE field_1 + ' %' OR
      title CONTAINS ' ' + field_2 + ' ' OR title LIKE '% ' + field_2 OR title LIKE field_2 + ' %' OR
      title CONTAINS ' ' + field_3 + ' ' OR title LIKE '% ' + field_3 OR title LIKE field_3 + ' %'
OMIT RECORD IF SOME(title CONTAINS negative)

更新: 由于在真实数据集上 LIKE 的评估看起来过于昂贵,另一种替代方法是在进行 CONTAINS 检查之前在两边填充标题。修改后的查询如下:

SELECT title, field_1, field_2, field_3 FROM (
SELECT title, field_1, field_2, field_3, SPLIT(table2.negative) negative FROM
(SELECT ' ' + title + ' ' AS title FROM 
 (SELECT 'The x301-b tops the x301-p' title),
 (SELECT 'The X301-p and x301-b are Top of the charts' title)) table1
CROSS JOIN
(SELECT * FROM
(SELECT 'x301-f' field_1, 'x301p' field_2, 'x301-p' field_3, 'x301-a,x301-c' negative),
(SELECT 'x301-b' field_1, 'x301b' field_2, 'x301-d' field_3, 'x301-h,x301-p' negative),
(SELECT 'x301'   field_1, 'x30'   field_2, '' field_3, '' negative)) table2
)
WHERE title CONTAINS ' ' + field_1 + ' ' OR
      title CONTAINS ' ' + field_2 + ' ' OR
      title CONTAINS ' ' + field_3 + ' '
OMIT RECORD IF SOME(title CONTAINS negative)