SQL 基于服务器的留言板中的单词流行度排行榜

Question

在 SQL 服务器数据库中，我有一个 table Messages 包含以下列：

编号INT(1,1)
详情VARCHAR(5000)
已输入日期时间DATETIME
已输入人员VARCHAR(25)

消息非常基本，只允许字母数字字符和少数特殊字符，如下所示：

`¬!"£$%^&*()-_=+[{]};:'@#~\|,<.>/?

忽略除撇号外的大部分特殊字符，我需要一种方法来列出每个单词以及该单词在“详细信息”列中出现的次数，然后我可以按 PersonEntered 进行过滤和 DatetimeEntered.

示例输出：

Word    Frequency
-----------------
a       11280
the     10102
and      8845
when     2024
don't    2013
.
.
.

不需要特别聪明。如果 dont 和 don't 被视为单独的词，那就完全没问题了。

我无法将单词拆分成名为 #Words 的临时 table。

一旦我有了临时 table，我将应用以下查询：

SELECT 
    Word, 
    SUM(Word) AS WordCount 
FROM #Words 
GROUP BY Word 
ORDER BY SUM(Word) DESC

请帮忙。

Answer 1

就我个人而言，我会去掉几乎所有的特殊字符，然后在 space 字符上使用分隔符。在您允许的字符中，只有 ' 会出现在一个单词中；其他任何东西都是符合语法的。

您尚未发布您使用的 SQL 版本，所以我将使用 SQL Server 2017 语法。如果您没有最新版本，则需要将 TRANSLATE 替换为嵌套的 REPLACE（因此 REPLACE(REPLACE(REPLACE(REPLACE(... REPLACE(M.Detail, '¬',' '),...),'/',' '),'?',' ')，并找到一个字符串拆分器（例如 Jeff Moden 的 DelimitedSplit8K).

USE Sandbox;
GO
CREATE TABLE [Messages] (Detail varchar(5000));

INSERT INTO [Messages]
VALUES ('Personally, I would strip out almost all the special characters, and then use a splitter on the space character. Of your permitted characters, only `''` is going to appear in a word; anything else is going to be grammatical. You haven''t posted what version of SQL you''re using, so I''ve going to use SQL Server 2017 syntax. If you don''t have the latest version, you''ll need to replace `TRANSLATE` with a nested `REPLACE` (So `REPLACE(REPLACE(REPLACE(REPLACE(... REPLACE(M.Detail, ''¬'','' ''),...),''/'','' ''),''?'','' '')`, and find a string splitter (for example, Jeff Moden''s [DelimitedSplit8K](http://www.sqlservercentral.com/articles/Tally+Table/72993/)).'),
       ('As a note, this is going to perform **AWFULLY**. SQL Server is not designed for this type of work. I also imagine you''ll get some odd results and it''ll include numbers in there. Things like dates are going to get split out,, numbers like `9,000,000` would be treated as the words `9` and `000`, and hyperlinks will be separated.')
GO
WITH Replacements AS(
    SELECT TRANSLATE(Detail, '`¬!"£$%^&*()-_=+[{]};:@#~\|,<.>/?','                                 ') AS StrippedDetail
    FROM [Messages] M)
SELECT SS.[value], COUNT(*) AS WordCount
FROM Replacements R
     CROSS APPLY string_split(R.StrippedDetail,' ') SS
WHERE LEN(SS.[value]) > 0
GROUP BY SS.[value]
ORDER BY WordCount DESC;
GO
DROP TABLE [Messages];

请注意，这将执行 AWFULLY。 SQL 服务器不适合此类工作。我还想你会得到一些奇怪的结果，其中会包含数字。日期之类的东西将被拆分，数字 9,000,000 将被视为单词 9 和 000，超链接将被分开。

SQL 基于服务器的留言板中的单词流行度排行榜

Word popularity leaderboard in SQL Server based message-board

sql-server

split

group-by

sum

alphanumeric