删除禁用词然后折叠数据

Question

获取 'exact' 与一组博客评论相匹配的禁用词。然后创建一个结果集，将禁用词折叠到所有者（博客评论）中，显示找到的禁用词和计数。

关于如何以更有效的方式执行此操作的任何建议 - 更少的 DML - 因为可能会有成千上万的评论？

Banned words:
though
man
about
hear

Blog comments:
'There are many of us.'
'The man.'
'So glad to hear about.' 
'So glad to hear about. A regular guy.'
'though, though, though.' 

1st entry: word is NOT banned - it's a variant of a banned word. Entry is NOT to be selected.
2nd entry: 1 banned word. Entry selected as 1 row, 1 banned word, counted as 1 banned word.
3rd entry: 2 different banned words. Entry selected as 1 row, 2 banned words separate by 
           commas, counted as 2 banned words.
4th entry: 2 different banned words. Entry selected as 1 row, 2 banned words separate by
           commas, counted as 2 banned words.
5th entry: 3 same banned words. Entry selected as 1 row, 1 banned word, counted as 3 banned 
           word.

Rules:
- Get the banned words in the blog comment.
- Only EXACT matches to the banned words. Do NOT include variants of the banned word.
- If there are more than 1 banned words in the same blog comment, only 1 row should be
  generated. 
- Generate the owner's row, include the banned words BannedWords column - non-unique banned 
  words separated by comma. 1 word for unique banned words.
  Count the banned words and include that column in the generated row. 

Desired Result - 4 rows:
BlogCommentId  BannedWords  t_Text1  t_Text2  t_Text3  CntOfBannedWords
2              man          e        f        g        1
3              hear,about   h        i        j        2
4              hear,about   k        l        m        2
5              though       n        o        p        3

具体禁词匹配码：

DECLARE @tableFinal TABLE (
        t0_BlogCommentId int,
        t0_Word VARCHAR(50),
        t0_Text1 varchar(10),
        t0_Text2 varchar(10),
        t0_Text3 varchar(10),
        t0_CntOfBannedWords int)

DECLARE @table1 TABLE (
        t_BlogCommentId int,
        t_Word VARCHAR(50),
        t_Text1 varchar(10),
        t_Text2 varchar(10),
        t_Text3 varchar(10));

DECLARE @BlogComment TABLE (
        BlogCommentId INT IDENTITY PRIMARY KEY,
        BlogCommentContent VARCHAR(MAX),
        Text1 varchar(10),
        Text2 varchar(10),
        Text3 varchar(10));
INSERT INTO @BlogComment 
           (BlogCommentContent 
            ,Text1                    
            ,Text2                    
            ,Text3)
VALUES
('There are many of us.',                        'a', 'b', 'c')
('The man.',                                     'e', 'f', 'g')
('So glad to hear about.',                       'h', 'i', 'j')
('So glad to hear about. A regular guy.',        'k', 'l', 'm')
('though, though, though.',                      'n', 'o', 'p');

DECLARE @BannedWords TABLE (
        BannedWordsId INT IDENTITY PRIMARY KEY,
        Word varchar(250));
INSERT INTO @BannedWords (Word) VALUES
('though'),
('man'),
('about'),
('hear');

;WITH rs AS
(
    SELECT word = REPLACE(REPLACE([value],'.',''),',','')
                  ,BlogCommentId
                  ,Text1
                  ,Text2
                  ,Text3
    FROM @BlogComment 
    CROSS APPLY STRING_SPLIT(BlogCommentContent, SPACE(1))
)
INSERT @table1
    (t_Word,
     t_BlogCommentId,
     t_Text1,
     t_Text2,
     t_Text3 )  
SELECT bw.Word
       ,rs.BlogCommentId
       ,rs.Text1
       ,rs.Text2
       ,rs.Text3
FROM rs
INNER JOIN @BannedWords bw ON rs.word = bw.Word;

折叠前上面 WITH 的结果。如果可能的话，我希望在 WITH 中在这里生成 'Desired Result'，而不必在它下面添加额外的代码。

SELECT *
FROM @table1

Results:
t_BlogCommentId  t_BannedWords  t_Text1  t_Text2  t_Text3
2                man            e        f        g
3                about          h        i        j
3                hear           h        i        j
4                about          k        l        m
4                hear           k        l        m
5                though         n        o        p
5                though         n        o        p
5                though         n        o        p

-- 'additional code to collapse':

INSERT @tableFinal
      (t0_BlogCommentId
       ,t0_Word
       ,t0_Text1
       ,t0_Text2
       ,t0_Text3
       ,t0_CntOfBannedWords )
SELECT DISTINCT t_BlogCommentId
                ,''
                ,''
                ,''
                ,''
                ,0
FROM @table1

UPDATE @tableFinal
SET t0_Word = t_Word
    ,t0_Text1 = t_Text1
    ,t0_Text2 = t_Text2
    ,t0_Text3 = t_Text3
FROM @table1
WHERE t0_BlogCommentId = t_BlogCommentId

UPDATE @tableFinal
SET t0_Word = t0_Word + ',' + t_Word
FROM @table1
WHERE t0_BlogCommentId = t_BlogCommentId AND t0_Word <> t_Word

UPDATE @tableFinal
SET t0_CntOfBannedWords = (SELECT Count (t_Word)
                           FROM @table1
                           WHERE t0_BlogCommentId = t_BlogCommentId)

崩溃的结果 - 现在是我的 'Desired Result' - 但更多的工作，如果有成千上万的评论，可能不适合：

SELECT t0_BlogCommentId as BlogCommentId
       ,t0_Word as BannedWords
       ,t0_Text1 as Text1
       ,t0_Text2 as Text2
       ,t0_Text3 as Text3
       ,t0_CntOfBannedWords as CntOfBannedWords
FROM @tableFinal

BlogCommentId  BannedWords  t_Text1  t_Text2  t_Text3  CntOfBannedWords
2              man          e        f        g        1
3              hear,about   h        i        j        2
4              hear,about   k        l        m        2
5              though       n        o        p        3

https://dbfiddle.uk/?rdbms=sqlserver_2017&fiddle=982989411c1a3e3fb784f1e0e46fd9e1

Answer 1

优秀post。感谢您以易于使用的格式提供所有数据。你真的很接近把这一切放在一起。您只需要最后一块来将这些值重新组合在一起。在这里，我为此使用了 STUFF。我刚开始使用您的“rs”cte。我添加了另一个 cte，然后使用了旧的 STUFF 技巧。这会为您的示例数据生成所需的输出。

DECLARE @BlogComment TABLE (
        BlogCommentId INT IDENTITY PRIMARY KEY,
        BlogCommentContent VARCHAR(MAX),
        Text1 varchar(10),
        Text2 varchar(10),
        Text3 varchar(10));
INSERT INTO @BlogComment 
           (BlogCommentContent 
            ,Text1                    
            ,Text2                    
            ,Text3)
VALUES
('There are many of us.',                        'a', 'b', 'c')
, ('The man.',                                     'e', 'f', 'g')
, ('So glad to hear about.',                       'h', 'i', 'j')
, ('So glad to hear about. A regular guy.',        'k', 'l', 'm')
, ('though, though, though.',                      'n', 'o', 'p');

DECLARE @BannedWords TABLE (
        BannedWordsId INT IDENTITY PRIMARY KEY,
        Word varchar(250));
INSERT INTO @BannedWords (Word) VALUES
('though'),
('man'),
('about'),
('hear');

;WITH rs AS
(
    SELECT word = REPLACE(REPLACE([value],'.',''),',','')
                  ,BlogCommentId
                  ,Text1
                  ,Text2
                  ,Text3
    FROM @BlogComment 
    CROSS APPLY STRING_SPLIT(BlogCommentContent, SPACE(1))
)
, ExpandedWords as
(
    SELECT bw.Word
       ,rs.BlogCommentId
       ,rs.Text1
       ,rs.Text2
       ,rs.Text3
    FROM rs
    INNER JOIN @BannedWords bw ON rs.word = bw.Word
)

select BlogCommentId
    , BannedWords = STUFF((select ', ' + e2.Word
                            from ExpandedWords e2
                            where e2.BlogCommentId = e1.BlogCommentId
                            --you could add an order by here if you want the list of words in a certain order.
                            FOR XML PATH('')), 1, 1, ' ')
    , e1.Text1
    , e1.Text2
    , e1.Text3
    , BannedWordsCount = count(*)
from ExpandedWords e1
group by e1.BlogCommentId
    , e1.Text1
    , e1.Text2
    , e1.Text3
order by e1.BlogCommentId

对于更现代的 sql 服务器版本，使用 STRING_AGG 比使用 STUFF 和 FOR XML 更简洁和更迟钝。这是它的样子。我在这里也使用 ROW_NUMBER，这样如果一个禁用词在输入中出现多次，您只能 return 一个禁用词实例。与上面的概念相同，所以这在代码中稍后开始。

, ExpandedWords as
(
    SELECT bw.Word
       , rs.BlogCommentId
       , rs.Text1
       , rs.Text2
       , rs.Text3
       , RowNum = ROW_NUMBER()over(partition by rs.BlogCommentId, bw.Word order by (select newid())) --order doesn't really matter here
    FROM rs
    INNER JOIN @BannedWords bw ON rs.word = bw.Word
)

select e.BlogCommentId
    , STRING_AGG(e.Word, ', ')
    , e.Text1
    , e.Text2
    , e.Text3
    , RowNum
from ExpandedWords e
where RowNum = 1
group by e.BlogCommentId
    , e.Text1
    , e.Text2
    , e.Text3
    , RowNum
order by e.BlogCommentId

删除禁用词然后折叠数据

Remove banned words then collapse the data

sql

sql-server

match