如何在 Google BigQuery 中的 URL 字符串中的模式后提取带有 SYMBOLS 的字符串

Question

我有两种可能的 URL 字符串形式

http://www.abcexample.com/landpage/?pps=[Y/lyPw==;id_1][Y/lyP2ZZYxi==;id_2];[5403;ord];
http://www.abcexample.com/landpage/?pps=Y/lyPw==;id_1;unknown;ord;

我想把两个例子中的Y/lyPw==都去掉

所以 ;id_1 之前的所有内容都在括号之间

总是在 ?pps= 部分之后

解决这个问题的最佳方法是什么？我想使用大查询语言，因为这是我的数据所在

Answer 1

这个正则表达式应该适合你

(\w+);id_1

它将提取XYZXYZ

它使用了Group capture

的概念

See this Demo

Answer 2

这是构建正则表达式的一种方法：

SELECT REGEXP_EXTRACT(url, r'\?pps=;[\[]?([^;]*);') FROM
(SELECT "http://www.abcexample.com/landpage/?pps=;[XYZXYZ;id_1][XYZZZZ;id_2];[5403;ord];" 
  AS url),
(SELECT "http://www.abcexample.com/landpage/?pps=;XYZXYZ;id_1;unknown;ord;"
  AS url)

Answer 3

您可以使用这个正则表达式：

pps=\[?([^;]+)

Working demo

这个正则表达式背后的想法是：

pps=    -> Look for the pps= pattern
\[?     -> might have a [ or not
([^;]+) -> store the content up to the first semi colon

因此，对于您的 url，此正则表达式将匹配（蓝色）并捕获（绿色），如下所示：

对于BigQuery你必须使用

REGEXP_EXTRACT('str', 'reg_exp')

引用其文档：

REGEXP_EXTRACT: Returns the portion of str that matches the capturing group within the regular expression.

您必须使用这样的代码：

SELECT
   REGEXP_EXTRACT(word,r'pps=\[?([^;]+)') AS fragment
FROM
   ...

对于工作示例代码，您可以使用：

SELECT
   REGEXP_EXTRACT(url,r'pps=\[?([^;]+)') AS fragment
FROM
(SELECT "http://www.abcexample.com/landpage/?pps=;[XYZXYZ;id_1][XYZZZZ;id_2];[5403;ord];" 
  AS url),
(SELECT "http://www.abcexample.com/landpage/?pps=;XYZXYZ;id_1;unknown;ord;"
  AS url)

如何在 Google BigQuery 中的 URL 字符串中的模式后提取带有 SYMBOLS 的字符串

how to extract out a string with SYMBOLS after a pattern in a URL string in Google BigQuery

regex

string

pattern-matching

google-bigquery