'Immediate Follow' BigQuery 中的页面路径

'Immediate Follow' Page Path in BigQuery

我在 BigQuery 中工作以了解有多少用户完成了特定的页面路径(在会话中的任何时间点)。假设页面路径是第 1 页 -> 第 2 页 -> 第 3 页。页面必须按顺序排列。我可以使用 BQ 来建立页面路径 - 但此方法仅适用于识别在会话中的任何时候访问这些页面的用户。例如,第 1 页 -> 第 456 页 -> 第 2 页。

有什么想法吗?

(SELECT [date]
, CASE WHEN pages like '/Page1' then fullVisitorId end as [users]
, CASE WHEN pages like '/Page1>>/Page2' then fullVisitorId end as [path_users_2]
, CASE WHEN pages like '/Page1>>Page2>>Page3' then fullVisitorId end as [path_users_3]
, [path_type]
, [path]
, [product]
, [device.deviceCategory]
FROM

  ( SELECT [date]
    , [fullVisitorId]
    , [visitId]
    , [visitNumber]
    , group_concat(hits.page.pagePath,'>>') as [pages]
    , 'New Pages' as [path_type]
    , 'Upgrade' as [path]
    , 'Professional' as [product]
  FROM
      (
      TABLE_DATE_RANGE
          ( [XXXXXX.ga_sessions_]
          , TIMESTAMP('2014-06-01')
          , TIMESTAMP('2014-06-05') )
      )
  where
  (REGEXP_MATCH(hits.page.pagePath,r'^/Page1($|/$|\?|/\?|%3F)'))
  or (REGEXP_MATCH(hits.page.pagePath,r'^/Page2($|/$|\?|/\?|%3F)'))
  or ( (REGEXP_MATCH(hits.page.pagePath,r'^/Page3($|/$|\?|/\?|%3F)'))
  and hits.transaction.transactionId is not null
  and hits.item.productSku is not null
  and hits.item.itemRevenue is not null )
  group each by [date]
  , [fullVisitorId]
  , [visitId]
  , [visitNumber]
  , [path_type]
  , [path]
  , [product]
  , [device.deviceCategory]
  )
group each by
[date]
, [path_type]
, [path]
, [product]
, [users]
, [path_users_2]
, [path_users_3]
, [device.deviceCategory]

)

您需要构造一个查询序列,并逐步到达您的完整路径,使用hits.time as time sequence. Taking example from Streak blog post: Using Google BigQuery for Event Tracking

我们可以创建一个子查询来确定 visitHomepage 事件:

(SELECT sessionId as sessionId1,
        timestamp as timestamp1
 FROM [events.log]
 WHERE name = "visitHomepage") AS step1

然后类似step2,step3.

然后你可以将这些组合起来得到steps1_2

(SELECT sessionId1,
        timestamp1,
        IF(timestamp1 < timestamp2, timestamp2, NULL) as timestamp2
 FROM
      (SELECT sessionId1,
              timestamp1,
              timestamp2
       FROM step1
       LEFT JOIN step2
       ON sessionId1 = sessionId2)
) AS steps1_2

得到我们想要的子查询!

(SELECT sessionId1 as sessionId,
        timestamp1 as visitHomepageTimestamp,
        timestamp2 as installExtensionTimestamp,
        IF(timestamp2 < timestamp3, timestamp3, NULL) as signInTimestamp
 FROM
      (SELECT sessionId2,
              timestamp2,
              timestamp3
       FROM steps1_2
       LEFT JOIN step3
       ON sessionId1 = sessionId3)
) AS steps1_2_3

阅读以上链接blog post to have a granular step by step explanation how to construct the query, and also check out BigQuery Cookbook

或者,您可以根据 hits.time 对查询进行排序,以定义用户访问的页面顺序,并使用 ROW_NUMBERPOSITION 为它们添加序号,这样您就可以进一步使用该结果集。

/对于您的特定用例,我很确定您可以通过避免 JOIN 和 GROUP BY 来更快地执行此操作。

考虑:

SELECT
  [date], fullVisitorId, visitId, visitNumber,
  GROUP_CONCAT(REGEXP_EXTRACT(hits.page.pagePath, '^(/[^/?]*)'), ">>")
    WITHIN RECORD AS Sequence,
FROM
  (TABLE_DATE_RANGE
      ( [XXXXXX.ga_sessions_]
      , TIMESTAMP('2014-06-01')
      , TIMESTAMP('2014-06-05') )
  )
WHERE REGEXP_MATCH(hits.page.pagePath, r'^/Page[123]')
HAVING
  Sequence CONTAINS "/Page1>>/Page2>>/Page3";

这利用了 RECORD 级别的 scoped aggregation 来避免 GROUP BY 单独的会话。

此外,单个记录在 Bigquery 中是原子的,它们的重复字段按照导入时提供的顺序进行处理。因此,对于 GA 会话日志,命中子记录在所有操作完成后按顺序连接 WITHIN RECORD。展平命中时间戳,然后将它们与比较结合起来,实际上只是重做这项工作。