MySQL 单列 n-gram 拆分和计数
MySQL single column n-gram split and count
给定 MySQL
中的一列字符串(密码)并给定一个值 N
,我正在寻找一种 sql 方法来计算每个 n- 的频率gram(长度为 n 的子串)。
把代码放在里面很重要MySQL,因为在我有的其他环境中,它会导致内存溢出。
同时我发现的唯一可行方法是假设字符串的长度有限(合法假设),select
分别通过提取不同位置的子字符串,union
然后group by
和count
,像这样(13 个字符中的 9 克):
Select
nueve,
count(*) as density,
avg(location) as avgloc
From
(select
mid(pass, 1, 9) as nueve, 1 as location
from
passdata
where
length(pass) >= 9 and length(pass) <= 13 UNION ALL select
mid(pass, 2, 9), 2 as location
from
passdata
where
length(pass) >= 10 and length(pass) <= 13 UNION ALL select
mid(pass, 3, 9), 3 as location
from
passdata
where
length(pass) >= 11 and length(pass) <= 13 UNION ALL select
mid(pass, 4, 9), 4 as location
from
passdata
where
length(pass) >= 12 and length(pass) <= 13 UNION ALL select
mid(pass, 5, 9), 5 as location
from
passdata
where
length(pass) = 13) as nueves
group by nueve
order by density DESC
结果如下所示:
nueve density avgloc
123456789 1387 2.4564
234567890 193 2.7306
987654321 141 2.0355
password1 111 1.7748
123123123 92 1.913
liverpool 89 1.618
111111111 86 2.2791
其中 nueve
是 9-gram,density
是出现次数,avgloc
是字符串中的平均起始位置
有什么改进查询的建议吗?我也在为其他 n-gram 做同样的事情。
谢谢!
创建一个table,其中包含从 1 到密码最大长度的所有数字。然后你可以加入这个以获得子串位置。
SELECT nueve, COUNT(*) AS density, AVG(location) as avgloc
FROM (
SELECT MID(p.pass, n.num, @N) AS nueve, n.num AS location
FROM passdata AS p
JOIN numbers_table AS n ON LENGTH(p.pass) >= (@N + n.num - 1)
) AS x
GROUP BY nueve
ORDER BY density DESC
给定 MySQL
中的一列字符串(密码)并给定一个值 N
,我正在寻找一种 sql 方法来计算每个 n- 的频率gram(长度为 n 的子串)。
把代码放在里面很重要MySQL,因为在我有的其他环境中,它会导致内存溢出。
同时我发现的唯一可行方法是假设字符串的长度有限(合法假设),select
分别通过提取不同位置的子字符串,union
然后group by
和count
,像这样(13 个字符中的 9 克):
Select
nueve,
count(*) as density,
avg(location) as avgloc
From
(select
mid(pass, 1, 9) as nueve, 1 as location
from
passdata
where
length(pass) >= 9 and length(pass) <= 13 UNION ALL select
mid(pass, 2, 9), 2 as location
from
passdata
where
length(pass) >= 10 and length(pass) <= 13 UNION ALL select
mid(pass, 3, 9), 3 as location
from
passdata
where
length(pass) >= 11 and length(pass) <= 13 UNION ALL select
mid(pass, 4, 9), 4 as location
from
passdata
where
length(pass) >= 12 and length(pass) <= 13 UNION ALL select
mid(pass, 5, 9), 5 as location
from
passdata
where
length(pass) = 13) as nueves
group by nueve
order by density DESC
结果如下所示:
nueve density avgloc 123456789 1387 2.4564 234567890 193 2.7306 987654321 141 2.0355 password1 111 1.7748 123123123 92 1.913 liverpool 89 1.618 111111111 86 2.2791
其中 nueve
是 9-gram,density
是出现次数,avgloc
是字符串中的平均起始位置
有什么改进查询的建议吗?我也在为其他 n-gram 做同样的事情。
谢谢!
创建一个table,其中包含从 1 到密码最大长度的所有数字。然后你可以加入这个以获得子串位置。
SELECT nueve, COUNT(*) AS density, AVG(location) as avgloc
FROM (
SELECT MID(p.pass, n.num, @N) AS nueve, n.num AS location
FROM passdata AS p
JOIN numbers_table AS n ON LENGTH(p.pass) >= (@N + n.num - 1)
) AS x
GROUP BY nueve
ORDER BY density DESC