在 BigQuery 中跨嵌套记录自连接
Self-Joining across nested Records in BigQuery
我正在尝试在单个 table 和 运行 中的嵌套字段之间做一些 joins/aggregations 处理 SQL 问题和 "Correlated sub queries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN" 错误.
我很想 SQL 帮助解决一般问题,但我也很好奇如何处理该错误。
我的问题映射到 BigQuery patents data。在该数据集中,一项专利具有分类数据(cpc
记录,其中 cpc.code
是记录中的一个分类代码以及相关数据 cpc.inventive
和 cpc.first
)。一项专利也有它引用的专利(citation
记录,其中 citation.publication_number
是具有相关数据 citation.type
和 citation.category
的引用专利。这些记录中有更多字段,但假设这些是重要的字段。
我想要得到的是类似这样的东西 json,每个 CPC 一行,记录包含捕获具有该 CPC 的专利如何根据引用专利和方面的 CPC 引用其他专利的信息关于每次点击费用和引用。 json 看起来像这样:
[
{
"citing_patent_cpc": "1234/123",
"cited_patent_cpcs":
[
{
"cpc": "ABCD/345",
"citing_cpc_inventive": true,
"citing_cpc_first": false,
"citation_type": "ABC",
"citation_category": "A",
"cited_cpc_inventive": false,
"cited_cpc_first": true,
"count": 45
},
{
"cpc": "ABCD/345",
"citing_cpc_inventive": true,
"citing_cpc_first": false,
"citation_type": "ABC",
"citation_category": "A",
"cited_cpc_inventive": false,
"cited_cpc_first": false,
"count": 12
},
{
"cpc": "H211/123",
"citing_cpc_inventive": true,
"citing_cpc_first": false,
"citation_type": "ABC",
"citation_category": null,
"cited_cpc_inventive": true,
"cited_cpc_first": false,
"count": 3
},
...
]
},
{
"citing_patent_cpc": "1234/ABC",
"cited_patent_cpcs":
[
{
"cpc": "ABCD/345",
"citing_cpc_inventive": true,
"citing_cpc_first": false,
"citation_type": "ABC",
"citation_category": "A",
"cited_cpc_inventive": false,
"cited_cpc_first": true,
"count": 16
},
{
"cpc": "ABCD/345",
"citing_cpc_inventive": true,
"citing_cpc_first": false,
"citation_type": "ABC",
"citation_category": "A",
"cited_cpc_inventive": false,
"cited_cpc_first": false,
"count": 3
},
{
"cpc": "H211/123",
"citing_cpc_inventive": true,
"citing_cpc_first": false,
"citation_type": "ABC",
"citation_category": null,
"cited_cpc_inventive": true,
"cited_cpc_first": false,
"count": 9
},
...
]
},
...
]
其中每个唯一 cpc.code
得到一行和一个数组。该数组记录了具有特定 CPC ("cpc") 行的专利引用专利数量的信息,其中包含两项专利的 CPC 的各个方面和引用类型.
例如,上例中的第一条记录表示有45次以CPC“1234/123”作为发明CPC而非第一CPC的专利引用了另一项以cpc "ABCD/345"作为第一CPC的专利但不是创造性的 CPC,并且此引文属于 "ABC" 和类别 "A"。理论上,每一行都可以为语料库中的每个 CPC 记录 * 可能的方面数,但实际上并非如此。
作为部分步骤,我尝试将引用专利的 cpc 记录加入到引用专利的记录中。我让这个查询直接在 SQL 中声明的非常小的 table 上工作,但是当我尝试在大数据上 运行 它时它给出 "Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN." (就像实际的专利 table)。
这是该查询:
SELECT
publication_number,
cpc,
citation,
(
SELECT ARRAY_CONCAT_AGG(cpc)
FROM `patents-public-data.patents.publications` AS JoinedPatents
RIGHT JOIN
(
SELECT publication_number
FROM UNNEST(Patents.citation)
) AS unnestedcitation
ON unnestedcitation.publication_number = JoinedPatents.publication_number) AS cited_cpc
FROM `patents-public-data.patents.publications`AS Patents
我很想知道:
- 如何使该查询正常工作而不出现该错误。
- 我的问题的整体解决方案,如果有人对 SQL-fu 感到慷慨。
感谢所有读到这里的人。
我认为这应该是您想要的查询的近似值。它看起来是一个大数据集,所以我无法对 speed/efficiency 发表评论。希望逻辑至少是有道理的。
with data as (
-- unnest your data
select
p.publication_number,
cp.code as cpc_code,
cp.inventive as cpc_inventive,
cp.first as cpc_first,
ci.publication_number as citation_publication_number,
ci.type as citation_type,
ci.category as citation_category
from `patents-public-data.patents.publications` p
left join unnest(cpc) cp
left join unnest(citation) ci
),
joined as (
-- do a self-join to join citation publication_number to original publication_number, group to get counts
select
d1.cpc_code as citing_patent_cpc,
d2.cpc_code as cpc,
d1.cpc_inventive as citing_cpc_inventive,
d1.cpc_first as citing_cpc_first,
d1.citation_type,
d1.citation_category,
d2.cpc_inventive as cited_cpc_inventive,
d2.cpc_first as cited_cpc_first,
count(*) as count
from data d1
left join data d2 on d1.citation_publication_number = d2.publication_number
group by 1,2,3,4,5,6,7,8
),
agged as (
-- aggrecate to match requested output
select
citing_patent_cpc,
array_agg(struct(cpc,citing_cpc_inventive,citing_cpc_first,citation_type,citation_category,cited_cpc_inventive,cited_cpc_first,count)) cited_patent_cpcs
from joined
group by 1
)
select * from agged
我正在尝试在单个 table 和 运行 中的嵌套字段之间做一些 joins/aggregations 处理 SQL 问题和 "Correlated sub queries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN" 错误.
我很想 SQL 帮助解决一般问题,但我也很好奇如何处理该错误。
我的问题映射到 BigQuery patents data。在该数据集中,一项专利具有分类数据(cpc
记录,其中 cpc.code
是记录中的一个分类代码以及相关数据 cpc.inventive
和 cpc.first
)。一项专利也有它引用的专利(citation
记录,其中 citation.publication_number
是具有相关数据 citation.type
和 citation.category
的引用专利。这些记录中有更多字段,但假设这些是重要的字段。
我想要得到的是类似这样的东西 json,每个 CPC 一行,记录包含捕获具有该 CPC 的专利如何根据引用专利和方面的 CPC 引用其他专利的信息关于每次点击费用和引用。 json 看起来像这样:
[
{
"citing_patent_cpc": "1234/123",
"cited_patent_cpcs":
[
{
"cpc": "ABCD/345",
"citing_cpc_inventive": true,
"citing_cpc_first": false,
"citation_type": "ABC",
"citation_category": "A",
"cited_cpc_inventive": false,
"cited_cpc_first": true,
"count": 45
},
{
"cpc": "ABCD/345",
"citing_cpc_inventive": true,
"citing_cpc_first": false,
"citation_type": "ABC",
"citation_category": "A",
"cited_cpc_inventive": false,
"cited_cpc_first": false,
"count": 12
},
{
"cpc": "H211/123",
"citing_cpc_inventive": true,
"citing_cpc_first": false,
"citation_type": "ABC",
"citation_category": null,
"cited_cpc_inventive": true,
"cited_cpc_first": false,
"count": 3
},
...
]
},
{
"citing_patent_cpc": "1234/ABC",
"cited_patent_cpcs":
[
{
"cpc": "ABCD/345",
"citing_cpc_inventive": true,
"citing_cpc_first": false,
"citation_type": "ABC",
"citation_category": "A",
"cited_cpc_inventive": false,
"cited_cpc_first": true,
"count": 16
},
{
"cpc": "ABCD/345",
"citing_cpc_inventive": true,
"citing_cpc_first": false,
"citation_type": "ABC",
"citation_category": "A",
"cited_cpc_inventive": false,
"cited_cpc_first": false,
"count": 3
},
{
"cpc": "H211/123",
"citing_cpc_inventive": true,
"citing_cpc_first": false,
"citation_type": "ABC",
"citation_category": null,
"cited_cpc_inventive": true,
"cited_cpc_first": false,
"count": 9
},
...
]
},
...
]
其中每个唯一 cpc.code
得到一行和一个数组。该数组记录了具有特定 CPC ("cpc") 行的专利引用专利数量的信息,其中包含两项专利的 CPC 的各个方面和引用类型.
例如,上例中的第一条记录表示有45次以CPC“1234/123”作为发明CPC而非第一CPC的专利引用了另一项以cpc "ABCD/345"作为第一CPC的专利但不是创造性的 CPC,并且此引文属于 "ABC" 和类别 "A"。理论上,每一行都可以为语料库中的每个 CPC 记录 * 可能的方面数,但实际上并非如此。
作为部分步骤,我尝试将引用专利的 cpc 记录加入到引用专利的记录中。我让这个查询直接在 SQL 中声明的非常小的 table 上工作,但是当我尝试在大数据上 运行 它时它给出 "Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN." (就像实际的专利 table)。
这是该查询:
SELECT
publication_number,
cpc,
citation,
(
SELECT ARRAY_CONCAT_AGG(cpc)
FROM `patents-public-data.patents.publications` AS JoinedPatents
RIGHT JOIN
(
SELECT publication_number
FROM UNNEST(Patents.citation)
) AS unnestedcitation
ON unnestedcitation.publication_number = JoinedPatents.publication_number) AS cited_cpc
FROM `patents-public-data.patents.publications`AS Patents
我很想知道:
- 如何使该查询正常工作而不出现该错误。
- 我的问题的整体解决方案,如果有人对 SQL-fu 感到慷慨。
感谢所有读到这里的人。
我认为这应该是您想要的查询的近似值。它看起来是一个大数据集,所以我无法对 speed/efficiency 发表评论。希望逻辑至少是有道理的。
with data as (
-- unnest your data
select
p.publication_number,
cp.code as cpc_code,
cp.inventive as cpc_inventive,
cp.first as cpc_first,
ci.publication_number as citation_publication_number,
ci.type as citation_type,
ci.category as citation_category
from `patents-public-data.patents.publications` p
left join unnest(cpc) cp
left join unnest(citation) ci
),
joined as (
-- do a self-join to join citation publication_number to original publication_number, group to get counts
select
d1.cpc_code as citing_patent_cpc,
d2.cpc_code as cpc,
d1.cpc_inventive as citing_cpc_inventive,
d1.cpc_first as citing_cpc_first,
d1.citation_type,
d1.citation_category,
d2.cpc_inventive as cited_cpc_inventive,
d2.cpc_first as cited_cpc_first,
count(*) as count
from data d1
left join data d2 on d1.citation_publication_number = d2.publication_number
group by 1,2,3,4,5,6,7,8
),
agged as (
-- aggrecate to match requested output
select
citing_patent_cpc,
array_agg(struct(cpc,citing_cpc_inventive,citing_cpc_first,citation_type,citation_category,cited_cpc_inventive,cited_cpc_first,count)) cited_patent_cpcs
from joined
group by 1
)
select * from agged