如何使用大型数据集(数百万行)的联接和子查询优化 MySQL 查询
How to optimize MySQL queries with Joins and subqueries for large datasets (millions of rows)
我正在尝试将国际专利数据库 (PATSTAT) 的四个大型 table(35-2 亿行)加入符合一些要求的引用最多的专利前 15 名。
第一个 table (t9
) 列出了一组(系列)应用程序对另一组应用程序的引用。
另一个 table (t1
) 基本上将所有内容链接在一起,同时包含家庭和申请 ID,以及申请年份
表 t2
和 tls209_appln_ipc
用于标识要包含的 appln_id
。
我最终得出的代码如下:
SELECT t9.cited_docdb_family_id, COUNT(t9.cited_docdb_family_id) AS cited, t3.appln_id
FROM docdb_family_citation t9
LEFT JOIN
(SELECT
t1.appln_id, t1.docdb_family_id from tls201_appln t1
LEFT JOIN tls204_appln_prior t2 on t1.appln_id=t2.appln_id
WHERE
t1.appln_filing_year BETWEEN 2010 AND 2015
AND
t2.appln_id IS NULL
AND
t1.appln_id IN (SELECT distinct appln_id from tls209_appln_ipc where ipc_subclass_symbol in ("A61K", "C07K", "A61P", "Cl2N", "C07D", "Cl2P", "C07H", "C12Q", "C07J"))) t3 ON t9.cited_docdb_family_id=t3.docdb_family_id
GROUP BY t9.cited_docdb_family_id
ORDER BY cited DESC
LIMIT 15
问题是 PATSTAT 基于 Web 的在线界面中的查询 运行 在我的会话超时之前没有收敛。有没有办法提高这个查询的效率?
-编辑-
tls_209_appln_ipc
包含 1.95 亿行 appln_id
和 ipc_subclass_symbol
。 appln_id
在此 table 中可能出现零次或多次。在我的查询中,如果 any 的链接 appln_id
s 链接到 any ,我只需要 docdb_family_id
s ipc_subclass_symbol
我列出来了。
我认为您创建了所需的索引,所以我将通过索引部分。
- 对您的子查询或主查询使用 views 并在后台更新它们是一个。这可能有助于解决超时问题,因为您将使用 select 的视图,而后台进程将 运行 您的慢速查询。
- 一个选项是range partitioning on appln_filing_year and may be a list partitioning on ipc_subclass_symbol... Year will not be a problem but ipc_subclass_symbol, i do not know how many unique data you have in this one but you can look for limitations here。在您的情况下,分区 return 结果会比正常情况快一点。
- 您可以在 my.cnf 或 运行 时间内将 wait_timeout 增加 mysql。如果您不更改它,则默认为 28800。但是我个人不喜欢这个。
希望对您有所帮助。
我很想先删除内部子查询,这可以作为主子查询中的 JOIN 来完成,使用 DISINCT 删除否则会创建的重复项:-
SELECT t9.cited_docdb_family_id, COUNT(t9.cited_docdb_family_id) AS cited, t3.appln_id
FROM docdb_family_citation t9
LEFT JOIN
(
SELECT DISTINCT t1.appln_id, t1.docdb_family_id
FROM tls201_appln t1
INNER JOIN tls209_appln_ipc t99 ON t1.appln_id = t99.appln_id
LEFT JOIN tls204_appln_prior t2 ON t1.appln_id = t2.appln_id
WHERE t1.appln_filing_year BETWEEN 2010 AND 2015
AND t2.appln_id IS NULL
AND t1.appln_id IN
AND t99.ipc_subclass_symbol IN ("A61K", "C07K", "A61P", "Cl2N", "C07D", "Cl2P", "C07H", "C12Q", "C07J")
) t3
ON t9.cited_docdb_family_id = t3.docdb_family_id
GROUP BY t9.cited_docdb_family_id
ORDER BY cited DESC
LIMIT 15
如果 t1.appln_id、t1.docdb_family_id 的重复组合可能出现在 tls201_appln table 的多行上,那么我建议 return 使这些行唯一键也是如此(因此 DISTINCT 将 return 不同的行而不是不同的值)。
这是您的查询:
SELECT t9.cited_docdb_family_id, COUNT(t9.cited_docdb_family_id) AS cited, t3.appln_id
FROM docdb_family_citation t9 LEFT JOIN
(SELECT t1.appln_id, t1.docdb_family_id
from tls201_appln t1 LEFT JOIN
tls204_appln_prior t2
on t1.appln_id=t2.appln_id
WHERE t1.appln_filing_year BETWEEN 2010 AND 2015 AND
t2.appln_id IS NULL AND
t1.appln_id IN (SELECT distinct appln_id
from tls209_appln_ipc
where ipc_subclass_symbol in ("A61K", "C07K", "A61P", "Cl2N", "C07D", "Cl2P", "C07H", "C12Q", "C07J"
)
)
) t3
ON t9.cited_docdb_family_id = t3.docdb_family_id
GROUP BY t9.cited_docdb_family_id
ORDER BY cited DESC
LIMIT 15;
这个查询有优化的余地。首先,在 MySQL 中应谨慎使用子查询,因为子查询已具体化。您在这里不需要子查询。您可以只链接 left join
操作。其次,select distinct
在 in
子查询中没有用。此外,通常 exists
更快。
我首先将其重写为:
SELECT t9.cited_docdb_family_id, COUNT(t9.cited_docdb_family_id) AS cited, t1.appln_id
FROM docdb_family_citation t9 LEFT JOIN
tls201_appln t1
on t9.cited_docdb_family_id = t1.docdb_family_id and
t1.appln_filing_year BETWEEN 2010 AND 2015 and
exists (select 1 from tls209_appln_ipc t209
where t209.appln_id = t1.appln_id AND
t209.ipc_subclass_symbol in ("A61K", "C07K", "A61P", "Cl2N", "C07D", "Cl2P", "C07H", "C12Q", "C07J")
) and
not exists (select 1 from tls204_appln_prior t2
where t1.appln_id = t2.appln_id
)
GROUP BY t9.cited_docdb_family_id
ORDER BY cited DESC
LIMIT 15;
对于此查询,您需要以下索引:tls204_appln_prior(appln_id)
、tls209_appln_ipc(appln_id, ipc_subclass_symbol)
、tls201_appln(cited_docdb_family_id, appln_id)
.
我不喜欢 on
子句中的 exists
和 not exists
,但这似乎是您正在寻找的语义。我强烈怀疑有更好的方法来编写查询,但你的问题没有提供足够的信息。更好的方法是首先聚合 t1
table,然后将结果 left join
聚合到 t9
table。但是,嵌套的 left join
和 exists
会让人感到困惑。
在前面答案的帮助下,给出了我正在寻找的结果的最终代码:
SELECT t9.cited_docdb_family_id, t99.cited AS cited, t1.appln_id, t1.appln_nr_epodoc
FROM docdb_family_citation t9
INNER JOIN (SELECT cited_docdb_family_id, count(cited_docdb_family_id) as cited FROM docdb_family_citation GROUP BY cited_docdb_family_id) t99
ON t9.cited_docdb_family_id = t99.cited_docdb_family_id
LEFT JOIN
tls201_appln t1
on t9.cited_docdb_family_id = t1.docdb_family_id
WHERE
t1.appln_filing_year BETWEEN 2010 AND 2015 and
exists (select 1 from tls209_appln_ipc t209
where t209.appln_id = t1.appln_id
and t209.ipc_subclass_symbol in ("A61K", "C07K", "A61P", "Cl2N", "C07D", "Cl2P", "C07H", "C12Q", "C07J")
) and
not exists (select 1 from tls204_appln_prior t2
where t1.appln_id = t2.appln_id
)
GROUP BY t9.cited_docdb_family_id
ORDER BY cited DESC
LIMIT 15;`
请注意,与子查询 t99
的连接用于获取正确的 cited
计数
我正在尝试将国际专利数据库 (PATSTAT) 的四个大型 table(35-2 亿行)加入符合一些要求的引用最多的专利前 15 名。
第一个 table (t9
) 列出了一组(系列)应用程序对另一组应用程序的引用。
另一个 table (t1
) 基本上将所有内容链接在一起,同时包含家庭和申请 ID,以及申请年份
表 t2
和 tls209_appln_ipc
用于标识要包含的 appln_id
。
我最终得出的代码如下:
SELECT t9.cited_docdb_family_id, COUNT(t9.cited_docdb_family_id) AS cited, t3.appln_id
FROM docdb_family_citation t9
LEFT JOIN
(SELECT
t1.appln_id, t1.docdb_family_id from tls201_appln t1
LEFT JOIN tls204_appln_prior t2 on t1.appln_id=t2.appln_id
WHERE
t1.appln_filing_year BETWEEN 2010 AND 2015
AND
t2.appln_id IS NULL
AND
t1.appln_id IN (SELECT distinct appln_id from tls209_appln_ipc where ipc_subclass_symbol in ("A61K", "C07K", "A61P", "Cl2N", "C07D", "Cl2P", "C07H", "C12Q", "C07J"))) t3 ON t9.cited_docdb_family_id=t3.docdb_family_id
GROUP BY t9.cited_docdb_family_id
ORDER BY cited DESC
LIMIT 15
问题是 PATSTAT 基于 Web 的在线界面中的查询 运行 在我的会话超时之前没有收敛。有没有办法提高这个查询的效率?
-编辑-
tls_209_appln_ipc
包含 1.95 亿行 appln_id
和 ipc_subclass_symbol
。 appln_id
在此 table 中可能出现零次或多次。在我的查询中,如果 any 的链接 appln_id
s 链接到 any ,我只需要 docdb_family_id
s ipc_subclass_symbol
我列出来了。
我认为您创建了所需的索引,所以我将通过索引部分。
- 对您的子查询或主查询使用 views 并在后台更新它们是一个。这可能有助于解决超时问题,因为您将使用 select 的视图,而后台进程将 运行 您的慢速查询。
- 一个选项是range partitioning on appln_filing_year and may be a list partitioning on ipc_subclass_symbol... Year will not be a problem but ipc_subclass_symbol, i do not know how many unique data you have in this one but you can look for limitations here。在您的情况下,分区 return 结果会比正常情况快一点。
- 您可以在 my.cnf 或 运行 时间内将 wait_timeout 增加 mysql。如果您不更改它,则默认为 28800。但是我个人不喜欢这个。
希望对您有所帮助。
我很想先删除内部子查询,这可以作为主子查询中的 JOIN 来完成,使用 DISINCT 删除否则会创建的重复项:-
SELECT t9.cited_docdb_family_id, COUNT(t9.cited_docdb_family_id) AS cited, t3.appln_id
FROM docdb_family_citation t9
LEFT JOIN
(
SELECT DISTINCT t1.appln_id, t1.docdb_family_id
FROM tls201_appln t1
INNER JOIN tls209_appln_ipc t99 ON t1.appln_id = t99.appln_id
LEFT JOIN tls204_appln_prior t2 ON t1.appln_id = t2.appln_id
WHERE t1.appln_filing_year BETWEEN 2010 AND 2015
AND t2.appln_id IS NULL
AND t1.appln_id IN
AND t99.ipc_subclass_symbol IN ("A61K", "C07K", "A61P", "Cl2N", "C07D", "Cl2P", "C07H", "C12Q", "C07J")
) t3
ON t9.cited_docdb_family_id = t3.docdb_family_id
GROUP BY t9.cited_docdb_family_id
ORDER BY cited DESC
LIMIT 15
如果 t1.appln_id、t1.docdb_family_id 的重复组合可能出现在 tls201_appln table 的多行上,那么我建议 return 使这些行唯一键也是如此(因此 DISTINCT 将 return 不同的行而不是不同的值)。
这是您的查询:
SELECT t9.cited_docdb_family_id, COUNT(t9.cited_docdb_family_id) AS cited, t3.appln_id
FROM docdb_family_citation t9 LEFT JOIN
(SELECT t1.appln_id, t1.docdb_family_id
from tls201_appln t1 LEFT JOIN
tls204_appln_prior t2
on t1.appln_id=t2.appln_id
WHERE t1.appln_filing_year BETWEEN 2010 AND 2015 AND
t2.appln_id IS NULL AND
t1.appln_id IN (SELECT distinct appln_id
from tls209_appln_ipc
where ipc_subclass_symbol in ("A61K", "C07K", "A61P", "Cl2N", "C07D", "Cl2P", "C07H", "C12Q", "C07J"
)
)
) t3
ON t9.cited_docdb_family_id = t3.docdb_family_id
GROUP BY t9.cited_docdb_family_id
ORDER BY cited DESC
LIMIT 15;
这个查询有优化的余地。首先,在 MySQL 中应谨慎使用子查询,因为子查询已具体化。您在这里不需要子查询。您可以只链接 left join
操作。其次,select distinct
在 in
子查询中没有用。此外,通常 exists
更快。
我首先将其重写为:
SELECT t9.cited_docdb_family_id, COUNT(t9.cited_docdb_family_id) AS cited, t1.appln_id
FROM docdb_family_citation t9 LEFT JOIN
tls201_appln t1
on t9.cited_docdb_family_id = t1.docdb_family_id and
t1.appln_filing_year BETWEEN 2010 AND 2015 and
exists (select 1 from tls209_appln_ipc t209
where t209.appln_id = t1.appln_id AND
t209.ipc_subclass_symbol in ("A61K", "C07K", "A61P", "Cl2N", "C07D", "Cl2P", "C07H", "C12Q", "C07J")
) and
not exists (select 1 from tls204_appln_prior t2
where t1.appln_id = t2.appln_id
)
GROUP BY t9.cited_docdb_family_id
ORDER BY cited DESC
LIMIT 15;
对于此查询,您需要以下索引:tls204_appln_prior(appln_id)
、tls209_appln_ipc(appln_id, ipc_subclass_symbol)
、tls201_appln(cited_docdb_family_id, appln_id)
.
我不喜欢 on
子句中的 exists
和 not exists
,但这似乎是您正在寻找的语义。我强烈怀疑有更好的方法来编写查询,但你的问题没有提供足够的信息。更好的方法是首先聚合 t1
table,然后将结果 left join
聚合到 t9
table。但是,嵌套的 left join
和 exists
会让人感到困惑。
在前面答案的帮助下,给出了我正在寻找的结果的最终代码:
SELECT t9.cited_docdb_family_id, t99.cited AS cited, t1.appln_id, t1.appln_nr_epodoc
FROM docdb_family_citation t9
INNER JOIN (SELECT cited_docdb_family_id, count(cited_docdb_family_id) as cited FROM docdb_family_citation GROUP BY cited_docdb_family_id) t99
ON t9.cited_docdb_family_id = t99.cited_docdb_family_id
LEFT JOIN
tls201_appln t1
on t9.cited_docdb_family_id = t1.docdb_family_id
WHERE
t1.appln_filing_year BETWEEN 2010 AND 2015 and
exists (select 1 from tls209_appln_ipc t209
where t209.appln_id = t1.appln_id
and t209.ipc_subclass_symbol in ("A61K", "C07K", "A61P", "Cl2N", "C07D", "Cl2P", "C07H", "C12Q", "C07J")
) and
not exists (select 1 from tls204_appln_prior t2
where t1.appln_id = t2.appln_id
)
GROUP BY t9.cited_docdb_family_id
ORDER BY cited DESC
LIMIT 15;`
请注意,与子查询 t99
的连接用于获取正确的 cited
计数