如何从 SQL 查询中的 CREATE/UPDATE/INSERT 语句中提取 table 名称?
How to extract the table name from a CREATE/UPDATE/INSERT statement in an SQL query?
我正在尝试解析存储在 table 列中的以下 sql 查询创建、插入或更新的 table。
我们将 table 列称为 query
。以下是一些示例数据,用于演示数据的外观变化。
with sample_data as (
select 1 as id, 'CREATE TABLE tbl1 ...' as query union all
select 2 as id, 'CREATE OR REPLACE TABLE tbl1 ...' as query union all
select 3 as id, 'DROP TABLE IF EXISTS tbl1; CREATE TABLE tbl1 ...' as query union all
select 4 as id, 'INSERT /*some comment*/ INTO tbl2 ...' as query union all
select 5 as id, 'INSERT /*some comment*/ INTO tbl2 ...' as query union all
select 6 as id, 'UPDATE tbl3 SET col1 = ...' as query union all
select 7 as id, '/*some garbage comments*/ UPDATE tbl3 SET col1 = ...' as query union all
select 8 as id, 'DELETE tbl4 ...' as query
),
以下是查询的格式(我们正在尝试提取 table_name
):
#1
some optional statements like drop table
创建 some comments or optional statement like OR REPLACE
TABLE table_name
everything else
#2
some optional statements like drop table
插入 some comments
插入 some comments
table_name
#3
some optional statements like drop table
更新 some comments
table_name
everything else
我认为这确实取决于您的数据,但您可能会使用如下方法取得一些成功:
with data as (
select 1 as id, 'CREATE TABLE tbl1 ...' as query union all
select 2 as id, 'INSERT INTO tbl2 ...' as query union all
select 3 as id, 'UPDATE tbl3 ...' as query union all
select 4 as id, 'DELETE tbl4 ...' as query
),
splitted as (
select id, split(query, ' ') as query_parts from data
)
select
id,
case
when query_parts[safe_offset(0)] in('CREATE', 'INSERT') then query_parts[safe_offset(2)]
when query_parts[safe_offset(0)] in('UPDATE', 'DELETE') then query_parts[safe_offset(1)]
else 'Error'
end as table_name
from splitted
当然,这取决于您 query
专栏中的整洁度和语法。此外,如果您的 table_name 符合 project.table.dataset
条件,您将需要进一步拆分。
正则表达式
要构建一个suitable正则表达式,让我们从以下相对simple/readable版本开始:
((CREATE( OR REPLACE)?|DROP) TABLE( IF EXISTS)?|UPDATE|DELETE|INSERT INTO) ([^\s\/*]+)
上面的所有space都可以用“至少一个白色space字符”代替,即\s+
。但我们也需要允许评论。对于看起来像 /*anything*/
的评论,正则表达式看起来像 \/\*.*\*\/
(其中评论字符用 \
转义,“任何东西”是 .*
中间)。鉴于可能有多个这样的评论,可以选择用白色 space 分隔,我们最终得到 (\s*\/\*.*\*\/\s*?)*\s+
。在任何有 space 的地方插入它会得到:
((CREATE((\s*\/\*.*\*\/\s*?)*\s+OR(\s*\/\*.*\*\/\s*?)*\s+REPLACE)?|DROP)(\s*\/\*.*\*\/\s*?)*\s+TABLE((\s*\/\*.*\*\/\s*?)*\s+IF(\s*\/\*.*\*\/\s*?)*\s+EXISTS)?|UPDATE|DELETE|INSERT(\s*\/\*.*\*\/\s*?)*\s+INTO)(\s*\/\*.*\*\/\s*?)*\s+([^\s\/*]+)
需要进一步细化:括号中的表达式已用于选择,例如(CHOICE1|CHOICE2)
。但是这种语法将它们包括为捕获组。实际上,我们只需要一个 table 名称的捕获组,因此我们可以通过 ?:
排除所有其他捕获组,例如(?:CHOICE1|CHOICE2)
。这给出:
(?:(?:CREATE(?:(?:\s*\/\*.*\*\/\s*?)*\s+OR(?:\s*\/\*.*\*\/\s*?)*\s+REPLACE)?|DROP)(?:\s*\/\*.*\*\/\s*?)*\s+TABLE(?:(?:\s*\/\*.*\*\/\s*?)*\s+IF(?:\s*\/\*.*\*\/\s*?)*\s+EXISTS)?|UPDATE|DELETE|INSERT(?:\s*\/\*.*\*\/\s*?)*\s+INTO)(?:\s*\/\*.*\*\/\s*?)*\s+([^\s\/*]+)
在线正则表达式演示
这是一个使用您的示例的演示:Regex101 demo
SQL
REGEXP_EXTRACT 的 Google BigQuery 文档说它将 return 与捕获组匹配的子字符串。所以我希望这样的事情能起作用:
with sample_data as (
select 1 as id, 'CREATE TABLE tbl1 ...' as query union all
select 2 as id, 'CREATE OR REPLACE TABLE tbl1 ...' as query union all
select 3 as id, 'DROP TABLE IF EXISTS tbl1; CREATE TABLE tbl1 ...' as query union all
select 4 as id, 'INSERT /*some comment*/ INTO tbl2 ...' as query union all
select 5 as id, 'INSERT /*some comment*/ INTO tbl2 ...' as query union all
select 6 as id, 'UPDATE tbl3 SET col1 = ...' as query union all
select 7 as id, '/*some garbage comments*/ UPDATE tbl3 SET col1 = ...' as query union all
select 8 as id, 'DELETE tbl4 ...' as query
)
SELECT
*, REGEXP_EXTRACT(query, r"(?:(?:CREATE(?:(?:\s*\/\*.*\*\/\s*?)*\s+OR(?:\s*\/\*.*\*\/\s*?)*\s+REPLACE)?|DROP)(?:\s*\/\*.*\*\/\s*?)*\s+TABLE(?:(?:\s*\/\*.*\*\/\s*?)*\s+IF(?:\s*\/\*.*\*\/\s*?)*\s+EXISTS)?|UPDATE|DELETE|INSERT(?:\s*\/\*.*\*\/\s*?)*\s+INTO)(?:\s*\/\*.*\*\/\s*?)*\s+([^\s\/*]+)") AS table_name
FROM sample_data;
(以上未经测试,如有问题请在评论中告诉我。)
我正在尝试解析存储在 table 列中的以下 sql 查询创建、插入或更新的 table。
我们将 table 列称为 query
。以下是一些示例数据,用于演示数据的外观变化。
with sample_data as (
select 1 as id, 'CREATE TABLE tbl1 ...' as query union all
select 2 as id, 'CREATE OR REPLACE TABLE tbl1 ...' as query union all
select 3 as id, 'DROP TABLE IF EXISTS tbl1; CREATE TABLE tbl1 ...' as query union all
select 4 as id, 'INSERT /*some comment*/ INTO tbl2 ...' as query union all
select 5 as id, 'INSERT /*some comment*/ INTO tbl2 ...' as query union all
select 6 as id, 'UPDATE tbl3 SET col1 = ...' as query union all
select 7 as id, '/*some garbage comments*/ UPDATE tbl3 SET col1 = ...' as query union all
select 8 as id, 'DELETE tbl4 ...' as query
),
以下是查询的格式(我们正在尝试提取 table_name
):
#1
some optional statements like drop table
创建 some comments or optional statement like OR REPLACE
TABLE table_name
everything else
#2
some optional statements like drop table
插入 some comments
插入 some comments
table_name
#3
some optional statements like drop table
更新 some comments
table_name
everything else
我认为这确实取决于您的数据,但您可能会使用如下方法取得一些成功:
with data as (
select 1 as id, 'CREATE TABLE tbl1 ...' as query union all
select 2 as id, 'INSERT INTO tbl2 ...' as query union all
select 3 as id, 'UPDATE tbl3 ...' as query union all
select 4 as id, 'DELETE tbl4 ...' as query
),
splitted as (
select id, split(query, ' ') as query_parts from data
)
select
id,
case
when query_parts[safe_offset(0)] in('CREATE', 'INSERT') then query_parts[safe_offset(2)]
when query_parts[safe_offset(0)] in('UPDATE', 'DELETE') then query_parts[safe_offset(1)]
else 'Error'
end as table_name
from splitted
当然,这取决于您 query
专栏中的整洁度和语法。此外,如果您的 table_name 符合 project.table.dataset
条件,您将需要进一步拆分。
正则表达式
要构建一个suitable正则表达式,让我们从以下相对simple/readable版本开始:
((CREATE( OR REPLACE)?|DROP) TABLE( IF EXISTS)?|UPDATE|DELETE|INSERT INTO) ([^\s\/*]+)
上面的所有space都可以用“至少一个白色space字符”代替,即\s+
。但我们也需要允许评论。对于看起来像 /*anything*/
的评论,正则表达式看起来像 \/\*.*\*\/
(其中评论字符用 \
转义,“任何东西”是 .*
中间)。鉴于可能有多个这样的评论,可以选择用白色 space 分隔,我们最终得到 (\s*\/\*.*\*\/\s*?)*\s+
。在任何有 space 的地方插入它会得到:
((CREATE((\s*\/\*.*\*\/\s*?)*\s+OR(\s*\/\*.*\*\/\s*?)*\s+REPLACE)?|DROP)(\s*\/\*.*\*\/\s*?)*\s+TABLE((\s*\/\*.*\*\/\s*?)*\s+IF(\s*\/\*.*\*\/\s*?)*\s+EXISTS)?|UPDATE|DELETE|INSERT(\s*\/\*.*\*\/\s*?)*\s+INTO)(\s*\/\*.*\*\/\s*?)*\s+([^\s\/*]+)
需要进一步细化:括号中的表达式已用于选择,例如(CHOICE1|CHOICE2)
。但是这种语法将它们包括为捕获组。实际上,我们只需要一个 table 名称的捕获组,因此我们可以通过 ?:
排除所有其他捕获组,例如(?:CHOICE1|CHOICE2)
。这给出:
(?:(?:CREATE(?:(?:\s*\/\*.*\*\/\s*?)*\s+OR(?:\s*\/\*.*\*\/\s*?)*\s+REPLACE)?|DROP)(?:\s*\/\*.*\*\/\s*?)*\s+TABLE(?:(?:\s*\/\*.*\*\/\s*?)*\s+IF(?:\s*\/\*.*\*\/\s*?)*\s+EXISTS)?|UPDATE|DELETE|INSERT(?:\s*\/\*.*\*\/\s*?)*\s+INTO)(?:\s*\/\*.*\*\/\s*?)*\s+([^\s\/*]+)
在线正则表达式演示
这是一个使用您的示例的演示:Regex101 demo
SQL
REGEXP_EXTRACT 的 Google BigQuery 文档说它将 return 与捕获组匹配的子字符串。所以我希望这样的事情能起作用:
with sample_data as (
select 1 as id, 'CREATE TABLE tbl1 ...' as query union all
select 2 as id, 'CREATE OR REPLACE TABLE tbl1 ...' as query union all
select 3 as id, 'DROP TABLE IF EXISTS tbl1; CREATE TABLE tbl1 ...' as query union all
select 4 as id, 'INSERT /*some comment*/ INTO tbl2 ...' as query union all
select 5 as id, 'INSERT /*some comment*/ INTO tbl2 ...' as query union all
select 6 as id, 'UPDATE tbl3 SET col1 = ...' as query union all
select 7 as id, '/*some garbage comments*/ UPDATE tbl3 SET col1 = ...' as query union all
select 8 as id, 'DELETE tbl4 ...' as query
)
SELECT
*, REGEXP_EXTRACT(query, r"(?:(?:CREATE(?:(?:\s*\/\*.*\*\/\s*?)*\s+OR(?:\s*\/\*.*\*\/\s*?)*\s+REPLACE)?|DROP)(?:\s*\/\*.*\*\/\s*?)*\s+TABLE(?:(?:\s*\/\*.*\*\/\s*?)*\s+IF(?:\s*\/\*.*\*\/\s*?)*\s+EXISTS)?|UPDATE|DELETE|INSERT(?:\s*\/\*.*\*\/\s*?)*\s+INTO)(?:\s*\/\*.*\*\/\s*?)*\s+([^\s\/*]+)") AS table_name
FROM sample_data;
(以上未经测试,如有问题请在评论中告诉我。)