如何正确连接所有键/如何识别连接键
How to join all keys correctly / How to identify join keys
所以我在创建 tables、加入时遇到问题,或者至少我认为,在加入多个 tables 时会导致重复结果。
我有一个大的 table,其中包含所有数据。另外,我有多个子table(分区table),它是从大table创建的,包含大table的一些信息。我希望从子 table 中提取的查询具有相同的结果。
这是我关于如何创建 3 个子 table 的查询。
第一 table - 交通
create table `traffic` as
select
fullVisitorId,
visitId,
userId,
max(case when c.index = 10 then c.value else null end) as universal_id,
max(case when c.index = 8 then c.value else null end) as ps_project_id,
max(case when c.index = 39 then c.value else null end) as ps_campaign_id,
visitStartTime,
date,
visitNumber,
adContent,
campaign,
campaignCode,
keyword,
medium,
referralPath,
source,
channelGrouping,
campaignId
from(
select
fullVisitorId,
visitId,
userId,
visitStartTime,
date,
visitNumber,
trafficSource.adContent,
trafficSource.campaign,
trafficSource.campaignCode,
trafficSource.keyword,
trafficSource.medium,
trafficSource.referralPath,
trafficSource.source,
channelGrouping,
trafficSource.adwordsClickInfo.campaignId,
h.customdimensions
from `ga_sessions_*`
left join unnest (hits) as h
WHERE _TABLE_SUFFIX BETWEEN '20211117' and '20211117'
)
left join unnest (customdimensions) as c
group by
fullVisitorId,
visitId,
userId,
visitStartTime,
date,
visitNumber,
adContent,
campaign,
campaignCode,
keyword,
medium,
referralPath,
source,
channelGrouping,
campaignId
第二个 table - 点击页面
create table `hits_page` as
select
fullVisitorId,
visitId,
userId,
max(case when c.index = 10 then c.value else null end) as universal_id,
max(case when c.index = 8 then c.value else null end) as ps_project_id,
max(case when c.index = 39 then c.value else null end) as ps_campaign_id,
visitStartTime,
date,
visitNumber,
pagePath,
pagePathLevel1,
pagePathLevel2,
pagePathLevel3,
pagePathLevel4,
hostname,
pageTitle,
searchKeyword,
searchCategory
from(
select
fullVisitorId,
visitId,
userId,
visitStartTime,
date,
visitNumber,
h.page.pagePath,
h.page.pagePathLevel1,
h.page.pagePathLevel2,
h.page.pagePathLevel3,
h.page.pagePathLevel4,
h.page.hostname,
h.page.pageTitle,
h.page.searchKeyword,
h.page.searchCategory,
h.customdimensions
from `ga_sessions_*`
left join unnest (hits) as h
WHERE _TABLE_SUFFIX BETWEEN '20211117' and '20211117'
)
left join unnest (customdimensions) as c
group by
fullVisitorId,
visitId,
userId,
visitStartTime,
date,
visitNumber,
pagePath,
pagePathLevel1,
pagePathLevel2,
pagePathLevel3,
pagePathLevel4,
hostname,
pageTitle,
searchKeyword,
searchCategory
第三个table - 命中
create table `hits` as
select
fullVisitorId,
visitId,
userId,
max(case when c.index = 10 then c.value else null end) as universal_id,
max(case when c.index = 8 then c.value else null end) as ps_project_id,
max(case when c.index = 39 then c.value else null end) as ps_campaign_id,
visitStartTime,
date,
visitNumber,
type,
hitNumber,
hour,
minute,
isEntrance,
isExit,
isInteraction,
time,
referer
from(
select
fullVisitorId,
visitId,
userId,
visitStartTime,
date,
visitNumber,
h.type,
h.hitNumber,
h.hour,
h.minute,
h.isEntrance,
h.isExit,
h.isInteraction,
h.time,
h.referer,
h.customdimensions
from `ga_sessions_*`
left join unnest (hits) as h
WHERE _TABLE_SUFFIX BETWEEN '20211117' and '20211117'
)
left join unnest (customdimensions) as c
group by
fullVisitorId,
visitId,
userId,
visitStartTime,
date,
visitNumber,
type,
hitNumber,
hour,
minute,
isEntrance,
isExit,
isInteraction,
time,
referer
现在,我使用的查询如下:
select
c.pagepath,
a.medium,
a.source,
count(*) as count_pageviews
from `traffic` as a
join `hits_page` as c
on a.fullVisitorId = c.fullVisitorId
and a.visitId = c.visitId
and a.visitStartTime = c.visitStartTime
and a.date = c.date
and a.visitNumber = c.visitNumber
join `hits` as d
on a.fullVisitorId = d.fullVisitorId
and a.visitId = d.visitId
and a.visitStartTime = d.visitStartTime
and a.date = d.date
and a.visitNumber = d.visitNumber
where pagepath = "/sellland"
and type = "PAGE"
group by 1,2,3
order by count_pageviews desc
如您所见,我加入了出现在其中 3 个子 table fullVisitorId
、visitId
、visitStartTime
、date
上的所有列,visitNumber
除了 userId
, pscampaignid
, universal_Id
, ps_project_Id
没有数据,连接时没有显示结果。 return 结果如下:
Row pagepath medium source count_pageviews
1 /sellland none direct 835
2 /sellland facebook_ad facebook 541
3 /sellland cpc google 390
4 /sellland referral lm.facebook.com 225
我已经更新了我创建这 3 个 table 的方式,结果现在看起来更接近了。我有一种感觉,这是由于我创建 table 的方式造成的,我现在将重点关注它。:
Row pagepath medium source count_pageviews
1 /sellland facebook_ad facebook 388
2 /sellland cpc google 252
3 /sellland none direct 182
4 /sellland referral lm.facebook.com 83
但是,它应该return像这样
Row pagepath medium source count_pageviews
1 /sellland facebook_ad facebook 357
2 /sellland cpc google 199
3 /sellland (none) (direct) 110
4 /sellland referral lm.facebook.com 48
我找不到这个问题的错误。我不确定是不是因为我 unnest
或 join
所有列出现在所有这些 3 table 本身或其他列上。预先感谢您的所有意见。
好的伙计们,看来我找到了答案。我不确定这是否是一个绝对的答案,但到目前为止它对我有用,如果我错了,请随时纠正我。所以,我相信这里的问题不是加入所有导致数据重复的连接键。我试图查询命中级别数据但没有加入命中级别键。因此,我确实连接了一个 hit.hitNumber 键,将所有命中级别行连接到表面。对于任何不熟悉这个的人,我认为在命中级别有数组(数据中的数据)。它看起来像这样。每行有超过 1 个数据,所以我们需要解包数据才能使用。因此,通过取消嵌套并将所述数据与 hit.hitNumber 连接起来,与 google 分析相比,我能够得到正确的数字。
通过取消嵌套,它应该看起来像这样。
这是我的代码让它工作:
select
c.pagepath,
a.medium,
a.source,
count(*) as count_pageviews
from `traffic` as a
join `hits_page as c
on a.fullVisitorId = c.fullVisitorId
and a.visitId = c.visitId
and a.visitStartTime = c.visitStartTime
and a.date = c.date
and a.visitNumber = c.visitNumber
and a.hitNumber = c.hitNumber
join `hits` as d
on a.fullVisitorId = d.fullVisitorId
and a.visitId = d.visitId
and a.visitStartTime = d.visitStartTime
and a.date = d.date
and a.visitNumber = d.visitNumber
and a.hitNumber = d.hitNumber
where c.pagepath = "/sellland"
and d.type = "PAGE"
group by 1,2,3
order by count_pageviews desc
所以我在创建 tables、加入时遇到问题,或者至少我认为,在加入多个 tables 时会导致重复结果。
我有一个大的 table,其中包含所有数据。另外,我有多个子table(分区table),它是从大table创建的,包含大table的一些信息。我希望从子 table 中提取的查询具有相同的结果。
这是我关于如何创建 3 个子 table 的查询。 第一 table - 交通
create table `traffic` as
select
fullVisitorId,
visitId,
userId,
max(case when c.index = 10 then c.value else null end) as universal_id,
max(case when c.index = 8 then c.value else null end) as ps_project_id,
max(case when c.index = 39 then c.value else null end) as ps_campaign_id,
visitStartTime,
date,
visitNumber,
adContent,
campaign,
campaignCode,
keyword,
medium,
referralPath,
source,
channelGrouping,
campaignId
from(
select
fullVisitorId,
visitId,
userId,
visitStartTime,
date,
visitNumber,
trafficSource.adContent,
trafficSource.campaign,
trafficSource.campaignCode,
trafficSource.keyword,
trafficSource.medium,
trafficSource.referralPath,
trafficSource.source,
channelGrouping,
trafficSource.adwordsClickInfo.campaignId,
h.customdimensions
from `ga_sessions_*`
left join unnest (hits) as h
WHERE _TABLE_SUFFIX BETWEEN '20211117' and '20211117'
)
left join unnest (customdimensions) as c
group by
fullVisitorId,
visitId,
userId,
visitStartTime,
date,
visitNumber,
adContent,
campaign,
campaignCode,
keyword,
medium,
referralPath,
source,
channelGrouping,
campaignId
第二个 table - 点击页面
create table `hits_page` as
select
fullVisitorId,
visitId,
userId,
max(case when c.index = 10 then c.value else null end) as universal_id,
max(case when c.index = 8 then c.value else null end) as ps_project_id,
max(case when c.index = 39 then c.value else null end) as ps_campaign_id,
visitStartTime,
date,
visitNumber,
pagePath,
pagePathLevel1,
pagePathLevel2,
pagePathLevel3,
pagePathLevel4,
hostname,
pageTitle,
searchKeyword,
searchCategory
from(
select
fullVisitorId,
visitId,
userId,
visitStartTime,
date,
visitNumber,
h.page.pagePath,
h.page.pagePathLevel1,
h.page.pagePathLevel2,
h.page.pagePathLevel3,
h.page.pagePathLevel4,
h.page.hostname,
h.page.pageTitle,
h.page.searchKeyword,
h.page.searchCategory,
h.customdimensions
from `ga_sessions_*`
left join unnest (hits) as h
WHERE _TABLE_SUFFIX BETWEEN '20211117' and '20211117'
)
left join unnest (customdimensions) as c
group by
fullVisitorId,
visitId,
userId,
visitStartTime,
date,
visitNumber,
pagePath,
pagePathLevel1,
pagePathLevel2,
pagePathLevel3,
pagePathLevel4,
hostname,
pageTitle,
searchKeyword,
searchCategory
第三个table - 命中
create table `hits` as
select
fullVisitorId,
visitId,
userId,
max(case when c.index = 10 then c.value else null end) as universal_id,
max(case when c.index = 8 then c.value else null end) as ps_project_id,
max(case when c.index = 39 then c.value else null end) as ps_campaign_id,
visitStartTime,
date,
visitNumber,
type,
hitNumber,
hour,
minute,
isEntrance,
isExit,
isInteraction,
time,
referer
from(
select
fullVisitorId,
visitId,
userId,
visitStartTime,
date,
visitNumber,
h.type,
h.hitNumber,
h.hour,
h.minute,
h.isEntrance,
h.isExit,
h.isInteraction,
h.time,
h.referer,
h.customdimensions
from `ga_sessions_*`
left join unnest (hits) as h
WHERE _TABLE_SUFFIX BETWEEN '20211117' and '20211117'
)
left join unnest (customdimensions) as c
group by
fullVisitorId,
visitId,
userId,
visitStartTime,
date,
visitNumber,
type,
hitNumber,
hour,
minute,
isEntrance,
isExit,
isInteraction,
time,
referer
现在,我使用的查询如下:
select
c.pagepath,
a.medium,
a.source,
count(*) as count_pageviews
from `traffic` as a
join `hits_page` as c
on a.fullVisitorId = c.fullVisitorId
and a.visitId = c.visitId
and a.visitStartTime = c.visitStartTime
and a.date = c.date
and a.visitNumber = c.visitNumber
join `hits` as d
on a.fullVisitorId = d.fullVisitorId
and a.visitId = d.visitId
and a.visitStartTime = d.visitStartTime
and a.date = d.date
and a.visitNumber = d.visitNumber
where pagepath = "/sellland"
and type = "PAGE"
group by 1,2,3
order by count_pageviews desc
如您所见,我加入了出现在其中 3 个子 table fullVisitorId
、visitId
、visitStartTime
、date
上的所有列,visitNumber
除了 userId
, pscampaignid
, universal_Id
, ps_project_Id
没有数据,连接时没有显示结果。 return 结果如下:
Row pagepath medium source count_pageviews
1 /sellland none direct 835
2 /sellland facebook_ad facebook 541
3 /sellland cpc google 390
4 /sellland referral lm.facebook.com 225
我已经更新了我创建这 3 个 table 的方式,结果现在看起来更接近了。我有一种感觉,这是由于我创建 table 的方式造成的,我现在将重点关注它。:
Row pagepath medium source count_pageviews
1 /sellland facebook_ad facebook 388
2 /sellland cpc google 252
3 /sellland none direct 182
4 /sellland referral lm.facebook.com 83
但是,它应该return像这样
Row pagepath medium source count_pageviews
1 /sellland facebook_ad facebook 357
2 /sellland cpc google 199
3 /sellland (none) (direct) 110
4 /sellland referral lm.facebook.com 48
我找不到这个问题的错误。我不确定是不是因为我 unnest
或 join
所有列出现在所有这些 3 table 本身或其他列上。预先感谢您的所有意见。
好的伙计们,看来我找到了答案。我不确定这是否是一个绝对的答案,但到目前为止它对我有用,如果我错了,请随时纠正我。所以,我相信这里的问题不是加入所有导致数据重复的连接键。我试图查询命中级别数据但没有加入命中级别键。因此,我确实连接了一个 hit.hitNumber 键,将所有命中级别行连接到表面。对于任何不熟悉这个的人,我认为在命中级别有数组(数据中的数据)。它看起来像这样。每行有超过 1 个数据,所以我们需要解包数据才能使用。因此,通过取消嵌套并将所述数据与 hit.hitNumber 连接起来,与 google 分析相比,我能够得到正确的数字。
通过取消嵌套,它应该看起来像这样。
这是我的代码让它工作:
select
c.pagepath,
a.medium,
a.source,
count(*) as count_pageviews
from `traffic` as a
join `hits_page as c
on a.fullVisitorId = c.fullVisitorId
and a.visitId = c.visitId
and a.visitStartTime = c.visitStartTime
and a.date = c.date
and a.visitNumber = c.visitNumber
and a.hitNumber = c.hitNumber
join `hits` as d
on a.fullVisitorId = d.fullVisitorId
and a.visitId = d.visitId
and a.visitStartTime = d.visitStartTime
and a.date = d.date
and a.visitNumber = d.visitNumber
and a.hitNumber = d.hitNumber
where c.pagepath = "/sellland"
and d.type = "PAGE"
group by 1,2,3
order by count_pageviews desc