在 BigQuery 中跨嵌套记录自连接

Self-Joining across nested Records in BigQuery

我正在尝试在单个 table 和 运行 中的嵌套字段之间做一些 joins/aggregations 处理 SQL 问题和 "Correlated sub queries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN" 错误.

我很想 SQL 帮助解决一般问题,但我也很好奇如何处理该错误。

我的问题映射到 BigQuery patents data。在该数据集中,一项专利具有分类数据(cpc 记录,其中 cpc.code 是记录中的一个分类代码以及相关数据 cpc.inventivecpc.first)。一项专利也有它引用的专利(citation 记录,其中 citation.publication_number 是具有相关数据 citation.typecitation.category 的引用专利。这些记录中有更多字段,但假设这些是重要的字段。

我想要得到的是类似这样的东西 json,每个 CPC 一行,记录包含捕获具有该 CPC 的专利如何根据引用专利和方面的 CPC 引用其他专利的信息关于每次点击费用和引用。 json 看起来像这样:

[
  {
      "citing_patent_cpc": "1234/123",
      "cited_patent_cpcs":
      [
        {
          "cpc": "ABCD/345",
          "citing_cpc_inventive": true,
          "citing_cpc_first": false,
          "citation_type": "ABC",
          "citation_category": "A",
          "cited_cpc_inventive": false,
          "cited_cpc_first": true,
          "count": 45
        },
        {
          "cpc": "ABCD/345",
          "citing_cpc_inventive": true,
          "citing_cpc_first": false,
          "citation_type": "ABC",
          "citation_category": "A",
          "cited_cpc_inventive": false,
          "cited_cpc_first": false,
          "count": 12
        },
        {
          "cpc": "H211/123",
          "citing_cpc_inventive": true,
          "citing_cpc_first": false,
          "citation_type": "ABC",
          "citation_category": null,
          "cited_cpc_inventive": true,
          "cited_cpc_first": false,
          "count": 3
        },
        ...

      ]
  },
  {
      "citing_patent_cpc": "1234/ABC",
      "cited_patent_cpcs":
      [
        {
          "cpc": "ABCD/345",
          "citing_cpc_inventive": true,
          "citing_cpc_first": false,
          "citation_type": "ABC",
          "citation_category": "A",
          "cited_cpc_inventive": false,
          "cited_cpc_first": true,
          "count": 16
        },
        {
          "cpc": "ABCD/345",
          "citing_cpc_inventive": true,
          "citing_cpc_first": false,
          "citation_type": "ABC",
          "citation_category": "A",
          "cited_cpc_inventive": false,
          "cited_cpc_first": false,
          "count": 3
        },
        {
          "cpc": "H211/123",
          "citing_cpc_inventive": true,
          "citing_cpc_first": false,
          "citation_type": "ABC",
          "citation_category": null,
          "cited_cpc_inventive": true,
          "cited_cpc_first": false,
          "count": 9
        },
        ...
      ]
  },
  ...
]

其中每个唯一 cpc.code 得到一行和一个数组。该数组记录了具有特定 CPC ("cpc") 行的专利引用专利数量的信息,其中包含两项专利的 CPC 的各个方面和引用类型.

例如,上例中的第一条记录表示有45次以CPC“1234/123”作为发明CPC而非第一CPC的专利引用了另一项以cpc "ABCD/345"作为第一CPC的专利但不是创造性的 CPC,并且此引文属于 "ABC" 和类别 "A"。理论上,每一行都可以为语料库中的每个 CPC 记录 * 可能的方面数,但实际上并非如此。

作为部分步骤,我尝试将引用专利的 cpc 记录加入到引用专利的记录中。我让这个查询直接在 SQL 中声明的非常小的 table 上工作,但是当我尝试在大数据上 运行 它时它给出 "Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN." (就像实际的专利 table)。

这是该查询:

SELECT
  publication_number,
  cpc,
  citation,
  (
    SELECT ARRAY_CONCAT_AGG(cpc)
    FROM `patents-public-data.patents.publications` AS JoinedPatents
    RIGHT JOIN
      (
        SELECT publication_number
        FROM UNNEST(Patents.citation)
      ) AS unnestedcitation
    ON unnestedcitation.publication_number = JoinedPatents.publication_number) AS cited_cpc
FROM `patents-public-data.patents.publications`AS Patents

我很想知道:

  1. 如何使该查询正常工作而不出现该错误。
  2. 我的问题的整体解决方案,如果有人对 SQL-fu 感到慷慨。

感谢所有读到这里的人。

我认为这应该是您想要的查询的近似值。它看起来是一个大数据集,所以我无法对 speed/efficiency 发表评论。希望逻辑至少是有道理的。

with data as (
-- unnest your data
  select 
    p.publication_number,
    cp.code as cpc_code,
    cp.inventive as cpc_inventive,
    cp.first as cpc_first,
    ci.publication_number as citation_publication_number,
    ci.type as citation_type,
    ci.category as citation_category
  from `patents-public-data.patents.publications` p
  left join unnest(cpc) cp
  left join unnest(citation) ci
),
joined as (
-- do a self-join to join citation publication_number to original publication_number, group to get counts
  select 
    d1.cpc_code as citing_patent_cpc,
    d2.cpc_code as cpc,
    d1.cpc_inventive as citing_cpc_inventive,
    d1.cpc_first as citing_cpc_first,
    d1.citation_type,
    d1.citation_category,
    d2.cpc_inventive as cited_cpc_inventive,
    d2.cpc_first as cited_cpc_first,
    count(*) as count
  from data d1
  left join data d2 on d1.citation_publication_number = d2.publication_number
  group by 1,2,3,4,5,6,7,8
),
agged as (
-- aggrecate to match requested output
  select 
    citing_patent_cpc,
    array_agg(struct(cpc,citing_cpc_inventive,citing_cpc_first,citation_type,citation_category,cited_cpc_inventive,cited_cpc_first,count)) cited_patent_cpcs
  from joined
  group by 1
)
select * from agged