如何找到 MySQL 中出现次数最多的单词？

Question

我有一个名为 results 的 table，有 5 列。

我想使用 title 列来查找说：WHERE title like '%for sale%' 的行，然后在该列中列出最流行的词。一个是 for，另一个是 sale，但我想看看还有哪些词与此相关。

示例数据：

title
cheap cars for sale
house for sale
cats and dogs for sale
iphones and androids for sale
cheap phones for sale
house furniture for sale

结果（单个词）：

for    6
sale    6
cheap    2
and    2
house    2
furniture 1
cars    1
etc...

Answer 1

更新

想法取自

此查询在我的机器 (MySQL 5.7) 上有效，但 Sqlfiddle 报告错误。基本思想是，您应该创建一个 table，在您的领域中使用从 1 到最大单词出现次数（如 4）的数字，或者像我一样，为简单起见，使用 UNION 1 .. 4。

CREATE TABLE products (
  `id` int,
  `name` varchar(45)
);

INSERT INTO products
    (`id`, `name`)
VALUES
    (1, 'for sale'),
    (2, 'for me'),
    (3, 'for you'),
    (4, 'you and me')
;

SELECT name, COUNT(*) as count FROM
(
SELECT
  product.id,
  SUBSTRING_INDEX(SUBSTRING_INDEX(product.name, ' ', numbers.n), ' ', -1) name
FROM
  (
    SELECT 1 AS n
    UNION SELECT 2
    UNION SELECT 3
    UNION SELECT 4
  ) AS numbers
  INNER JOIN products product
  ON CHAR_LENGTH(product.name)
     -CHAR_LENGTH(REPLACE(product.name, ' ', ''))>=numbers.n-1
ORDER BY
  id, n
)
AS result
GROUP BY name
ORDER BY count DESC

结果将是

for | 3
you | 2
me  | 2
and | 1
sale| 1

Answer 2

您可以通过一些字符串操作来提取单词。假设您有一个数字 table 并且单词由单个空格分隔：

select substring_index(substring_index(r.title, ' ', n.n), ' ', -1) as word,
       count(*)
from results r join
     numbers n
     on n.n <= length(title) - length(replace(title, ' ', '')) + 1
group by word;

如果您没有数字 table，您可以使用子查询手动构造一个：

from results r join
     (select 1 as n union all select 2 union all select 3 union all . . .
     ) n
     . . .

SQL Fiddle（@GrzegorzAdamKowalski 提供）是 here。

Answer 3

这会给你一个词（如果我明白你的 single word 是什么意思。）：

select concat(val,' ',cnt) as result from(
    select (substring_index(substring_index(t.title, ' ', n.n), ' ', -1)) val,count(*) as cnt
        from result t cross join(
         select a.n + b.n * 10 + 1 n
         from 
                (select 0 as n union all select 1 union all select 2 union all select 3 
                        union all select 4 union all select 5 union all select 6 
                        union all select 7 union all select 8 union all select 9) a,
                (select 0 as n union all select 1 union all select 2 union all select 3 
                        union all select 4 union all select 5 union all select 6 
                        union all select 7 union all select 8 union all select 9) b
                order by n 
        ) n
    where n.n <= 1 + (length(t.title) - length(replace(t.title, ' ', '')))
    group by val
    order by cnt desc
) as x

结果应该是这样的：

Result
--------
for 6
sale 6
house 2
and 2
cheap 2
phones 1
iphones 1
dogs 1
furniture 1
cars 1
androids 1
cats 1

但是如果 single word 你需要这样:

result
-----------
for 6 sale 6 house 2 and 2 cheap 2 phones 1 iphones 1 dogs 1 furniture 1 cars 1 androids 1 cats 1

只需将上面的查询修改为：

select group_concat(concat(val,' ',cnt) separator ' ') as result from( ...

Answer 4

SQL 不太适合这个任务，虽然可能有限制（例如字数）

执行相同任务的快速 PHP 脚本可能更易于长期使用（也可能更快）

<?php
$rows = [
    "cheap cars for sale",
    "house for sale",
    "cats and dogs for sale",
    "iphones and androids for sale",
    "cheap phones for sale",
    "house furniture for sale",
];

//rows here should be replaced by the SQL result
$wordTotals = [];
foreach ($rows as $row) {
   $words = explode(" ", $row);
    foreach ($words as $word) {
        if (isset($wordTotals[$word])) {
            $wordTotals[$word]++; 
            continue;
        }

        $wordTotals[$word] = 1;
    }
}

arsort($wordTotals);

foreach($wordTotals as $word => $count) {
    echo $word . " " . $count . PHP_EOL;
}

输出

for 6
sale 6
and 2
cheap 2
house 2
phones 1
androids 1
furniture 1
cats 1
cars 1
dogs 1
iphones 1

Answer 5

这里正在工作 SQL Fiddle: http://sqlfiddle.com/#!9/0b0a0/32

让我们从两个 table 开始 - 一个用于文本，一个用于数字：

CREATE TABLE text (`title` varchar(29));

INSERT INTO text
    (`title`)
VALUES
    ('cheap cars for sale'),
    ('house for sale'),
    ('cats and dogs for sale'),
    ('iphones and androids for sale'),
    ('cheap phones for sale'),
    ('house furniture for sale')
;

CREATE TABLE iterator (`index` int);

INSERT INTO iterator
    (`index`)
VALUES
    (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12),(13),(14),(15),
    (16),(17),(18),(19),(20),(21),(22),(23),(24),(25),(26),(27),(28),(29),(30)
;

第二个 table、iterator 必须包含从 1 到 N 的数字，其中 N 大于或等于 text 中最长字符串的长度。

然后，运行这个查询：

select
  words.word, count(*) as `count`
from 
(select
  substring(concat(' ', t.title, ' '), i.index+1, j.index-i.index) as word
from
  text as t, iterator as i, iterator as j
where
    substring(concat(' ', t.title), i.index, 1) = ' '
and substring(concat(t.title, ' '), j.index, 1) = ' '
and i.index < j.index
) AS words
where
    length(words.word) > 0
and words.word not like '% %'
group by words.word
order by `count` desc, words.word asc

有两个选择。外层简单地对单个单词进行分组和计数（长度大于 0 且没有任何 spaces 的单词）。内部提取所有从任何 space 字符开始并以任何其他 space 字符结束的字符串，因此字符串不是单词（尽管将此子查询命名为 words），因为它们可以包含其他 spaces 比开始和结束一个。

结果：

word    count
for     6
sale    6
and     2
cheap   2
house   2
androids    1
cars    1
cats    1
dogs    1
furniture   1
iphones     1
phones  1

Answer 6

您可以通过一些有趣的方式使用 ExtractValue。请在此处查看 SQL fiddle：http://sqlfiddle.com/#!9/0b0a0/45

我们只需要一个table:

CREATE TABLE text (`title` varchar(29));

INSERT INTO text (`title`)
VALUES
    ('cheap cars for sale'),
    ('house for sale'),
    ('cats and dogs for sale'),
    ('iphones and androids for sale'),
    ('cheap phones for sale'),
    ('house furniture for sale')
;

现在我们构造 select 系列，从转换为 XML 的文本中提取整个单词。每个 select 从文本中提取第 N 个单词。

select words.word, count(*) as `count` from
(select ExtractValue(CONCAT('<w>', REPLACE(title, ' ', '</w><w>'), '</w>'), '//w[1]') as word from `text`
union all
select ExtractValue(CONCAT('<w>', REPLACE(title, ' ', '</w><w>'), '</w>'), '//w[2]') from `text`
union all
select ExtractValue(CONCAT('<w>', REPLACE(title, ' ', '</w><w>'), '</w>'), '//w[3]') from `text`
union all
select ExtractValue(CONCAT('<w>', REPLACE(title, ' ', '</w><w>'), '</w>'), '//w[4]') from `text`
union all
select ExtractValue(CONCAT('<w>', REPLACE(title, ' ', '</w><w>'), '</w>'), '//w[5]') from `text`) as words
where length(words.word) > 0
group by words.word
order by `count` desc, words.word asc

如何找到 MySQL 中出现次数最多的单词？

How to find most popular word occurrences in MySQL?

mysql

sql

denormalization