如何在不更改 table 模式的情况下将查询结果存储在当前 table 上？

Question

我有结构

  {
    id: "123",
    scans:[{
       "scanid":"123",
       "status":"sleep"
      }]
  },
  {
    id: "123",
    scans:[{
       "scanid":"123",
       "status":"sleep"
      }]
  }

删除重复项的查询：

      SELECT *
    FROM (
      SELECT
          *,
          ROW_NUMBER()
              OVER (PARTITION BY id)
              row_number,
      FROM table1
    )
    WHERE row_number = 1

我将目的地 table 指定为 table1。

这里我把scans做成重复记录，scanid做成string，status做成string。但是当我做一些查询（我正在查询以删除重复项）并覆盖现有的 table 时，table 模式被更改。它变成 scans_scanid(string) 和 scans_status(string)。扫描记录架构现已更改。请指出我哪里出错了？

Answer 1

1) 如果您运行在 Web 上查询 UI，结果会自动 flattened，这就是您看到架构已更改的原因。

您需要运行您的查询并写入 destination table，您可以在网络上选择 UI 也可以执行此操作。

2) 如果您没有运行您在网络上的查询 UI 但仍然看到架构已更改，则您应该进行明确的选择以便为您保留架构，例如：

select 'foo' as scans.scanid

这会为您创建一个类似输出的记录，但它不会是重复记录，请进一步阅读。

3) 对于某些用例，您可能需要使用 NEST(expr) function 其中

Aggregates all values in the current aggregation scope into a repeated field. For example, the query "SELECT x, NEST(y) FROM ... GROUP BY x" returns one output record for each distinct x value, and contains a repeated field for all y values paired with x in the query input. The NEST function requires a GROUP BY clause.

BigQuery automatically flattens query results, so if you use the NEST function on the top level query, the results won't contain repeated fields. Use the NEST function when using a subselect that produces intermediate results for immediate use by the same query.

Answer 2

已知 NEST() 与 UnFlatten 结果输出不兼容，主要用于子查询中的中间结果。

尝试以下解决方法
请注意，我使用 INTEGER 作为 id 和 scanid。如果它们应该是 STRING，您需要
一种。在输出架构部分进行更改
以及
b.在 t = {scanid:parseInt(x[0]), status:x[1]}

中删除对 parseInt() 函数的使用

SELECT id, scans.scanid, scans.status 
FROM JS(
  (      // input table
    SELECT id, NEST(CONCAT(STRING(scanid), ',', STRING(status))) AS scans
    FROM (
      SELECT id, scans.scanid, scans.status 
      FROM (
        SELECT id, scans.scanid, scans.status, 
               ROW_NUMBER() OVER (PARTITION BY id) AS dup
        FROM table1
      ) WHERE dup = 1  
    ) GROUP BY id
  ),
  id, scans,     // input columns
  "[{'name': 'id', 'type': 'INTEGER'},    // output schema
    {'name': 'scans', 'type': 'RECORD',
     'mode': 'REPEATED',
     'fields': [
       {'name': 'scanid', 'type': 'INTEGER'},
       {'name': 'status', 'type': 'STRING'}
     ]    
    }
  ]",
  "function(row, emit){    // function 
    var c = [];
    for (var i = 0; i < row.scans.length; i++) {
      x = row.scans[i].toString().split(',');
      t = {scanid:parseInt(x[0]), status:x[1]}
      c.push(t);
    };
    emit({id: row.id, scans: c});  
  }"
)

这里我用的是BigQuery User-Defined Functions. They are extremely powerful yet still have some Limits and Limitations to be aware of. Also have in mind - they are quite a candidates for being qualified as expensive High-Compute queries

Complex queries can consume extraordinarily large computing resources relative to the number of bytes processed. Typically, such queries contain a very large number of JOIN or CROSS JOIN clauses or complex User-defined Functions.

如何在不更改 table 模式的情况下将查询结果存储在当前 table 上？

How to store the result of query on the current table without changing the table schema?

schema

overwrite

sql-update

google-bigquery