BigQuery Javascript UDF 失败 "Resources Exceeded"

Question

这个问题可能是

的另一个例子

BigQuery UDF memory exceeded error on multiple rows but works fine on single row

但有人建议我 post 作为问题而不是答案。

我正在使用 javascript 将日志文件解析为 table。 javascript 解析函数比较简单。它适用于 1M 行，但在 3M 行时失败。日志文件可能比 3M 大很多，所以失败是个问题。

函数如下

function parseLogRow(row, emit) {

    r =  (row.logrow ? row.logrow : "") + (row.l2 ? " " + row.l2 : "") + (row.l3 ? " " + row.l3 : "")
    ts = null
    category = null
    user = null
    message = null
    db = null
    seconds = null
    found = false
    if (r) {
        m = r.match(/^(\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d\.\d\d\d (\+|\-)\d\d\d\d) \[([^|]*)\|([^|]*)\|([^\]]*)\] ::( \(([\d\.]+)s\))? (.*)/ )
        if( m){
          ts = new Date(m[1])*1
          category = m[3] || null
          user = m[4] || null
          db = m[5] || null
          seconds = m[7] || null
          message = m[8] || null
          found = true
        }
        else {
          message = r
          found = false
        }
     }

    emit({
      ts:  ts,
      category: category,
      user: user,
      db: db,
      seconds: seconds*1.0,
      message: message,
      found: found
      });
  }


  bigquery.defineFunction(
    'parseLogRow',                           // Name of the function exported to SQL
    ['logrow',"l2","l3"],                    // Names of input columns
    [
      {'name': 'ts', 'type': 'float'},  // Output schema
      {'name': 'category', 'type': 'string'},
      {'name': 'user', 'type': 'string'},
      {'name': 'db', 'type': 'string'},
      {'name': 'seconds', 'type': 'float'},
      {'name': 'message', 'type': 'string'},
      {'name': 'found', 'type': 'boolean'},
    ],
    parseLogRow                          // Reference to JavaScript UDF
  );

我正在使用此查询引用该函数：

SELECT
    ROW_NUMBER() OVER() as row_num,
    ts,category,user,
    db,seconds,message,found,
FROM parseLogRow((SELECT * FROM[#{dataset}.today]
      LIMIT 1000000
    ))

'today' table 中的一些示例数据如下所示（作为 CSV）：

logrow,l2,l3
# Logfile created on 2015-12-29 00:00:09 -0800 by logger.rb/v1.2.7,,
2015-12-29 00:00:09.262 -0800 [INFO|7aaa0|] :: Running scheduled job: confirm running gulp process,,
2015-12-29 00:00:09.277 -0800 [DEBUG|7aaa0|] :: Restarted gulp process,,
2015-12-29 00:00:09.278 -0800 [INFO|7aaa0|] :: Completed scheduled job: confirm running gulp process,,
2015-12-29 00:00:14.343 -0800 [DEBUG|7aaa2|scheduler] :: Polling for pending tasks (master: true),,
2015-12-29 00:00:19.396 -0800 [INFO|7aaa4|] :: Running scheduled job: confirm running gulp process,,
2015-12-29 00:00:19.409 -0800 [DEBUG|7aaa4|] :: Restarted gulp process,,
2015-12-29 00:00:19.410 -0800 [INFO|7aaa4|] :: Completed scheduled job: confirm running gulp process,,
2015-12-29 00:00:29.487 -0800 [INFO|7aaa6|] :: Running scheduled job: confirm running gulp process,,
2015-12-29 00:00:29.500 -0800 [DEBUG|7aaa6|] :: Restarted gulp process,,
2015-12-29 00:00:29.500 -0800 [INFO|7aaa6|] :: Completed scheduled job: confirm running gulp process,,
2015-12-29 00:00:39.597 -0800 [INFO|7aaa8|] :: Running scheduled job: confirm running gulp process,,
2015-12-29 00:00:39.610 -0800 [DEBUG|7aaa8|] :: Restarted gulp process,,
2015-12-29 00:00:39.611 -0800 [INFO|7aaa8|] :: Completed scheduled job: confirm running gulp process,,
2015-12-29 00:00:44.659 -0800 [DEBUG|7aaaa|scheduler] :: Polling for pending tasks (master: true),,
2015-12-29 00:00:49.687 -0800 [INFO|7aaac|] :: Running scheduled job: confirm running gulp process,,
2015-12-29 00:00:49.689 -0800 [DEBUG|7aaac|] :: Restarted gulp process,,
2015-12-29 00:00:49.689 -0800 [INFO|7aaac|] :: Completed scheduled job: confirm running gulp process,,
2015-12-29 00:00:59.869 -0800 [INFO|7aaae|] :: Running scheduled job: confirm running gulp process,,
2015-12-29 00:00:59.871 -0800 [DEBUG|7aaae|] :: Restarted gulp process,,

这有点麻烦，因为我将日志导入为 3 列 table（实际上，只有一列），方法是将其导入为 CSV 并将分隔符设置为制表符（我们通常日志文件中没有任何选项卡），使用查询将其转换为我真正想要的 table。

我喜欢这种模式，因为它的解析速度快且分布式（当它工作时）。

失败的作业是：bigquery-looker：bquijob_260be029_153dd96cfdb。如果您需要可复制的案例，请与我联系。

如有任何帮助或建议，我们将不胜感激。

Answer 1

第 1 点
我没有看到 UDF 的问题 - 即使在 1000 万行上它也对我有效
我认为问题在于使用 ROW_NUMBER() OVER() - 删除它，它应该可以工作！

SELECT
  ts,category,user,
  db,seconds,message,found,
FROM parseLogRow((SELECT * FROM[#{dataset}.today]
))

第 2 点
从性能的角度来看，下面应该运行更快（我认为）并且通常我建议避免在 "plain" BQ 也能正常工作的情况下使用 UDF

SELECT 
  REGEXP_EXTRACT(logrow, r'^(\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d\.\d\d\d (?:\+|\-)\d\d\d\d) \[[^|]*\|[^|]*\|[^\]]*\] :: .*') AS ts,
  REGEXP_EXTRACT(logrow, r'^(?:\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d\.\d\d\d (?:\+|\-)\d\d\d\d) \[([^|]*)\|(?:[^|]*)\|(?:[^\]]*)\] :: (?:.*)') AS category,
  REGEXP_EXTRACT(logrow, r'^(?:\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d\.\d\d\d (?:\+|\-)\d\d\d\d) \[(?:[^|]*)\|([^|]*)\|(?:[^\]]*)\] :: (?:.*)') AS user,
  REGEXP_EXTRACT(logrow, r'^(?:\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d\.\d\d\d (?:\+|\-)\d\d\d\d) \[(?:[^|]*)\|(?:[^|]*)\|([^\]]*)\] :: (?:.*)') AS db,
  REGEXP_EXTRACT(logrow, r'^(?:\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d\.\d\d\d (?:\+|\-)\d\d\d\d) \[(?:[^|]*)\|(?:[^|]*)\|(?:[^\]]*)\] :: (.*)') AS message,
  REGEXP_MATCH(logrow, r'^((?:\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d\.\d\d\d (?:\+|\-)\d\d\d\d) \[(?:[^|]*)\|(?:[^|]*)\|(?:[^\]]*)\] :: (?:.*))') AS found
FROM (
  SELECT logrow +IFNULL(' ' + l2, '') + IFNULL(' ' + l3, '') AS logrow 
  FROM YourTable    
)

BigQuery Javascript UDF 失败 "Resources Exceeded"

BigQuery Javascript UDF fails with "Resources Exceeded"

google-bigquery

udf