如何正确连接所有键/如何识别连接键

How to join all keys correctly / How to identify join keys

所以我在创建 tables、加入时遇到问题,或者至少我认为,在加入多个 tables 时会导致重复结果。

我有一个大的 table,其中包含所有数据。另外,我有多个子table(分区table),它是从大table创建的,包含大table的一些信息。我希望从子 table 中提取的查询具有相同的结果。

这是我关于如何创建 3 个子 table 的查询。 第一 table - 交通

 create table `traffic` as

select 
fullVisitorId,
visitId,
userId,
max(case when c.index = 10 then c.value else null end) as universal_id,
max(case when c.index = 8 then c.value else null end) as ps_project_id,
max(case when c.index = 39 then c.value else null end) as ps_campaign_id,
visitStartTime,
date,
visitNumber,
adContent,
campaign,
campaignCode,
keyword,
medium,
referralPath,
source,
channelGrouping,
campaignId

from(

select 
fullVisitorId, 
visitId, 
userId, 
visitStartTime, 
date, 
visitNumber, 
trafficSource.adContent,
trafficSource.campaign,
trafficSource.campaignCode,
trafficSource.keyword,
trafficSource.medium,
trafficSource.referralPath,
trafficSource.source,
channelGrouping,
trafficSource.adwordsClickInfo.campaignId,
h.customdimensions

from `ga_sessions_*`

left join unnest (hits) as h

WHERE _TABLE_SUFFIX BETWEEN '20211117' and '20211117' 
)
left join unnest (customdimensions) as c

group by 
fullVisitorId, 
visitId, 
userId, 
visitStartTime, 
date, 
visitNumber, 
adContent,
campaign,
campaignCode,
keyword,
medium,
referralPath,
source,
channelGrouping,
campaignId 

第二个 table - 点击页面

    create table `hits_page` as


select 
fullVisitorId,
visitId,
userId,
max(case when c.index = 10 then c.value else null end) as universal_id,
max(case when c.index = 8 then c.value else null end) as ps_project_id,
max(case when c.index = 39 then c.value else null end) as ps_campaign_id,
visitStartTime,
date,
visitNumber,
pagePath,
pagePathLevel1,
pagePathLevel2,
pagePathLevel3,
pagePathLevel4,
hostname,
pageTitle,
searchKeyword,
searchCategory

from(

select 
fullVisitorId, 
visitId, 
userId, 
visitStartTime, 
date, 
visitNumber, 
h.page.pagePath,
h.page.pagePathLevel1,
h.page.pagePathLevel2,
h.page.pagePathLevel3,
h.page.pagePathLevel4,
h.page.hostname,
h.page.pageTitle,
h.page.searchKeyword,
h.page.searchCategory,
h.customdimensions

from `ga_sessions_*`

left join unnest (hits) as h

WHERE _TABLE_SUFFIX BETWEEN '20211117' and '20211117' 
)
left join unnest (customdimensions) as c

group by 
fullVisitorId, 
visitId, 
userId, 
visitStartTime, 
date, 
visitNumber, 
pagePath,
pagePathLevel1,
pagePathLevel2,
pagePathLevel3,
pagePathLevel4,
hostname,
pageTitle,
searchKeyword,
searchCategory

第三个table - 命中

create table `hits` as

select 
fullVisitorId,
visitId,
userId,
max(case when c.index = 10 then c.value else null end) as universal_id,
max(case when c.index = 8 then c.value else null end) as ps_project_id,
max(case when c.index = 39 then c.value else null end) as ps_campaign_id,
visitStartTime,
date,
visitNumber,
type,
hitNumber,
hour,
minute,
isEntrance,
isExit,
isInteraction,
time,
referer

from(

select 
fullVisitorId,
visitId,
userId,
visitStartTime,
date,
visitNumber,
h.type,
h.hitNumber,
h.hour,
h.minute,
h.isEntrance,
h.isExit,
h.isInteraction,
h.time,
h.referer,
h.customdimensions

from `ga_sessions_*`

left join unnest (hits) as h

WHERE _TABLE_SUFFIX BETWEEN '20211117' and '20211117' 
)
left join unnest (customdimensions) as c

group by 
fullVisitorId, 
visitId, 
userId, 
visitStartTime, 
date, 
visitNumber, 
type,
hitNumber,
hour,
minute,
isEntrance,
isExit,
isInteraction,
time,
referer

现在,我使用的查询如下:

select 
    c.pagepath,
    a.medium,
    a.source,
    count(*) as count_pageviews
 
from `traffic` as a

 join `hits_page` as c
 on a.fullVisitorId = c.fullVisitorId 
 and a.visitId = c.visitId 
 and a.visitStartTime = c.visitStartTime 
 and a.date = c.date
 and a.visitNumber = c.visitNumber

 join `hits` as d
 on a.fullVisitorId = d.fullVisitorId 
 and a.visitId = d.visitId 
 and a.visitStartTime = d.visitStartTime 
 and a.date = d.date
 and a.visitNumber = d.visitNumber
 
where pagepath = "/sellland"
and type = "PAGE"
 
group by 1,2,3
 
order by count_pageviews desc

如您所见,我加入了出现在其中 3 个子 table fullVisitorIdvisitIdvisitStartTimedate 上的所有列,visitNumber 除了 userId, pscampaignid, universal_Id, ps_project_Id 没有数据,连接时没有显示结果。 return 结果如下:

Row pagepath        medium      source  count_pageviews 
1   /sellland        none       direct        835
2   /sellland    facebook_ad  facebook        541
3   /sellland        cpc        google        390
4   /sellland      referral  lm.facebook.com  225

我已经更新了我创建这 3 个 table 的方式,结果现在看起来更接近了。我有一种感觉,这是由于我创建 table 的方式造成的,我现在将重点关注它。:

Row pagepath        medium      source     count_pageviews  
1   /sellland      facebook_ad  facebook         388
2   /sellland         cpc        google          252
3   /sellland         none       direct          182
4   /sellland      referral  lm.facebook.com     83

但是,它应该return像这样

Row pagepath       medium      source      count_pageviews  
1   /sellland   facebook_ad   facebook         357
2   /sellland       cpc        google          199
3   /sellland     (none)      (direct)         110
4   /sellland     referral   lm.facebook.com    48

我找不到这个问题的错误。我不确定是不是因为我 unnestjoin 所有列出现在所有这些 3 table 本身或其他列上。预先感谢您的所有意见。

好的伙计们,看来我找到了答案。我不确定这是否是一个绝对的答案,但到目前为止它对我有用,如果我错了,请随时纠正我。所以,我相信这里的问题不是加入所有导致数据重复的连接键。我试图查询命中级别数据但没有加入命中级别键。因此,我确实连接了一个 hit.hitNumber 键,将所有命中级别行连接到表面。对于任何不熟悉这个的人,我认为在命中级别有数组(数据中的数据)。它看起来像这样。每行有超过 1 个数据,所以我们需要解包数据才能使用。因此,通过取消嵌套并将所述数据与 hit.hitNumber 连接起来,与 google 分析相比,我能够得到正确的数字。

通过取消嵌套,它应该看起来像这样。

这是我的代码让它工作:

select
   c.pagepath,
   a.medium,
   a.source,
   count(*) as count_pageviews
 
from `traffic` as a
 
join `hits_page as c
on a.fullVisitorId = c.fullVisitorId
and a.visitId = c.visitId
and a.visitStartTime = c.visitStartTime
and a.date = c.date
and a.visitNumber = c.visitNumber
and a.hitNumber = c.hitNumber
 
join `hits` as d
on a.fullVisitorId = d.fullVisitorId
and a.visitId = d.visitId
and a.visitStartTime = d.visitStartTime
and a.date = d.date
and a.visitNumber = d.visitNumber
and a.hitNumber = d.hitNumber
 
where  c.pagepath = "/sellland"
and d.type = "PAGE"
 
group by 1,2,3
 
order by count_pageviews desc

参考:https://towardsdatascience.com/explore-arrays-and-structs-for-better-performance-in-google-bigquery-8978fb00a5bc