在新 Table 中根据其他中的值创建新列并考虑事件归因建模

Create New Columns in New Table Based on Values in Other with Event Attribution Modeling in Mind

我正在努力创建一个 table 来帮助我的公司进行归因建模。我们有几个数据集,包括发票、公司、人员和事件数据。

我们的数据很复杂,因为我们与 B2B(企业对企业)客户打交道。因此,这并不像查看一个人的事件并将发票总额归因于他们所做的事件(或行为)那么简单。

相反,我们的发票引用了公司 ID,我们的员工引用了公司 ID - 然后我们的员工引用了他们的事件。因此,我目前正在基于这种关系加入我的 table,并且拥有一个包含所有信息的巨大 table。

看起来像这样:

INVOICE_ID INVOICE_DATE INVOICE_TOTAL PERSON_COMPANY_ID PERSON_EMAIL EVENT_NAME EVENT_DATE DAYS_BETWEEN_EVENT_AND_INVOICE
111 3/7/2022 4.80 ABC john@coolcompany.com Spoke to Sales Rep 2/10/2022 25
111 3/7/2022 4.80 ABC jenny@coolcompany.com Form Submitted 6/8/2021 272
111 3/7/2022 4.80 ABC jenny@coolcompany.com Spoke to Sales Rep 2/10/2022 25
111 3/7/2022 4.80 ABC jim@coolcompany.com Clicked Email 3/21/2022 -14
111 3/7/2022 4.80 ABC jim@coolcompany.com Chat on Website 3/2/2022 5
111 3/7/2022 4.80 ABC jim@coolcompany.com Opened Email 3/7/2022 0
111 3/7/2022 4.80 ABC jim@coolcompany.com Spoke to Sales Rep 2/10/2022 25
111 3/7/2022 4.80 ABC jim@coolcompany.com Google Ad 2/28/2022 7
111 3/7/2022 4.80 ABC jim@coolcompany.com Google Ad 3/1/2022 6
111 3/7/2022 4.80 ABC jim@coolcompany.com Google Ad 3/2/2022 5
111 3/7/2022 4.80 ABC jim@coolcompany.com Google Ad 3/14/2022 -7
111 3/7/2022 4.80 ABC mark@coolcompany.com Spoke to Sales Rep 2/10/2022 25
111 3/7/2022 4.80 ABC mark@coolcompany.com Form Submitted 12/2/2021 95
222 3/7/2022 4.80 XYZ tom@coolcompany.com Spoke to Sales Rep 2/10/2022 25
222 3/7/2022 0.25 XYZ andy@testcompany.com Spoke to Sales Rep 6/3/2021 277
222 3/7/2022 0.25 XYZ andy@testcompany.com Spoke to Sales Rep 4/8/2021 333
222 3/7/2022 0.25 XYZ andy@testcompany.com Spoke to Sales Rep 6/4/2021 276
222 3/7/2022 0.25 XYZ andy@testcompany.com Spoke to Sales Rep 2/23/2022 12
222 3/7/2022 0.25 XYZ phil@testcompany.com Spoke to Sales Rep 2/23/2022 12
222 3/7/2022 0.25 XYZ jordan@testcompany.com Spoke to Sales Rep 4/8/2021 333
222 3/7/2022 0.25 XYZ jordan@testcompany.com Spoke to Sales Rep 6/4/2021 276
222 3/7/2022 0.25 XYZ jordan@testcompany.com Spoke to Sales Rep 2/23/2022 12
222 3/7/2022 0.25 XYZ matt@testcompany.com Spoke to Sales Rep 2/23/2022 12

我想创建一个 table,其中包含基于发票发生的最后五个事件的事件位置列。并且仅在发票日期的最后 90 天内。所以我想创建一个新的 table,看起来可能像这样:

INVOICE_ID INVOICE_DATE INVOICE_TOTAL PERSON_COMPANY_ID EVENT_5 EVENT_5_EMAIL EVENT_5_DATE Event 4 Event 4 Email Event 4 Date Event 3 Event 3 Email Event 3 Date Event 2 Event 2 Email Event 2 Date Event 1 Event 1 Email Event 1 Date
111 3/7/2022 4.80 ABC Google Ad jim@coolcompany.com 2/28/2022 Google Ad jim@coolcompany.com 3/1/2022 Google Ad jim@coolcompany.com 3/2/2022 Chat on Website jim@coolcompany.com 3/2/2022 Opened Email jim@coolcompany.com 3/7/2022
222 3/7/2022 0.25 XYZ Spoke to Sales Rep nick@testcompany.com 2/23/2022 Spoke to Sales Rep matt@testcompany.com 2/23/2022 Spoke to Sales Rep jordan@testcompany.com 2/23/2022 Spoke to Sales Rep phil@testcompany.com 2/23/2022 Spoke to Sales Rep andy@testcompany.com 2/23/2022

为了尝试创建它,我添加了 DAYS_BETWEEN_EVENT_AND_INVOICE 列,如您在第一个 table 中看到的那样。我认为使用它来过滤负值可以让我更接近,但我不确定这是否是进行归因的最佳方法。我也不确定如何从根本上循环遍历我的 table 并根据这些条件填写我的第二个 table:发票的最后 5 个事件,仅持续 90 天。

我正在使用 SQL,Snowflake 数据仓库和最终的 Power BI 来可视化这些数据。

您可以在 Power Query 中执行此操作(=> 转换)

在此查询生成的数据中,发票 222 的发票总额可能存在错误。这可能是由于拼写错误,其中该发票的最新事件行具有相同的发票价值 111.

let
    Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("tdRPb4IwFADwr9KQHR1tXxHYTWe2HZfMw/4YD1WaWaEtge7gt1/VgNnEBAUupXmkv/foK10sPEqpN/IYjjAQADe9G5PAj4mbTR9nbtyajZ6sjcnWRuVc73z3dOF5blKBrEFznokSvYncBQFTUjkw9pajNr7QeteQ4NkUCs1/VkpaKxIXCHG8N/YcRNAN7696qRr4WSbXqUjQk+IyOwBAK+GeBp3oDbfIaPQuVqW04ohXRKeiX3Oh/9Rcrydd2IG3+sWY70ygaXJQIa5WR32hDNetC/sz+2nZvzqD+oy1/HrFi3TIll3wz35tCscN2XMPRxsOYKP98fnlRmu6nLZzngLx3cuK5zrZTawobQs/xOx0M0X9+8Hp5mOMDVF/cKo/7N93vWWVQ6Gdn29kNqS/NUXC9ZAduC7DLT24LsMtu6S4tbf6y18=", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [INVOICE_ID = _t, INVOICE_DATE = _t, INVOICE_TOTAL = _t, PERSON_COMPANY_ID = _t, PERSON_EMAIL = _t, EVENT_NAME = _t, EVENT_DATE = _t, DAYS_BETWEEN_EVENT_AND_INVOICE = _t]),
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"INVOICE_ID", Int64.Type}, {"INVOICE_DATE", type date}, {"INVOICE_TOTAL", Currency.Type}, {"PERSON_COMPANY_ID", type text}, {"PERSON_EMAIL", type text}, {"EVENT_NAME", type text}, {"EVENT_DATE", type date}, {"DAYS_BETWEEN_EVENT_AND_INVOICE", Int64.Type}}),

//removed this column since we won't need it
    #"Removed Columns" = Table.RemoveColumns(#"Changed Type",{"DAYS_BETWEEN_EVENT_AND_INVOICE"}),
    
//Group by Invoice
    #"Grouped Rows" = Table.Group(#"Removed Columns", {"INVOICE_ID"}, {
        {"within90", (t)=> let 

        //Filter the table by duration between invoice date and event date
        //then sort descending by event date and split off the first five rows
        //   note that split will be populated by fewer rows if there are not five dates in the range
            x = Table.Split(
                    Table.Sort(
                        Table.SelectRows(t, 
                            each Duration.Days([INVOICE_DATE]-[EVENT_DATE]) < 90 and 
                                Duration.Days([INVOICE_DATE]-[EVENT_DATE]) >= 0),
                    {"EVENT_DATE", Order.Descending}), 
                5){0}, 

        //generate a list of records, along with their field names, for those events
            events = List.Generate(()=>
                [evEM=x{0}[PERSON_EMAIL] , evN=x{0}[EVENT_NAME], evD=x{0}[EVENT_DATE] , idx=0],
                each [idx] < Table.RowCount(x),
                each  [evEM=x{[idx]+1}[PERSON_EMAIL] , evN=x{[idx]+1}[EVENT_NAME], evD=x{[idx]+1}[EVENT_DATE] , idx=[idx]+1],
                each Record.FromList( 
                    {[evN],[evEM],[evD]},
                        {"EVENT_" & Text.From([idx]+1), 
                         "EVENT_" & Text.From([idx]+1) & " EMAIL", 
                         "EVENT_" & Text.From([idx]+1) & " DATE"})),

        //combine the generated records with the first row of each subTable to create new table rows
            newTable = Record.Combine({t{0}} & List.Reverse(events))        
            
        in 
            newTable}
        }),

//expand the records to new columns and set the data types
    #"Expanded within90" = Table.ExpandRecordColumn(#"Grouped Rows", "within90", {"INVOICE_DATE", "INVOICE_TOTAL", "PERSON_COMPANY_ID", "PERSON_EMAIL", "EVENT_NAME", "EVENT_DATE", "EVENT_5", "EVENT_5 EMAIL", "EVENT_5 DATE", "EVENT_4", "EVENT_4 EMAIL", "EVENT_4 DATE", "EVENT_3", "EVENT_3 EMAIL", "EVENT_3 DATE", "EVENT_2", "EVENT_2 EMAIL", "EVENT_2 DATE", "EVENT_1", "EVENT_1 EMAIL", "EVENT_1 DATE"}, {"INVOICE_DATE", "INVOICE_TOTAL", "PERSON_COMPANY_ID", "PERSON_EMAIL", "EVENT_NAME", "EVENT_DATE", "EVENT_5", "EVENT_5 EMAIL", "EVENT_5 DATE", "EVENT_4", "EVENT_4 EMAIL", "EVENT_4 DATE", "EVENT_3", "EVENT_3 EMAIL", "EVENT_3 DATE", "EVENT_2", "EVENT_2 EMAIL", "EVENT_2 DATE", "EVENT_1", "EVENT_1 EMAIL", "EVENT_1 DATE"}),
    #"Changed Type1" = Table.TransformColumnTypes(#"Expanded within90",{{"INVOICE_DATE", type date}, {"INVOICE_TOTAL", type number}, {"PERSON_COMPANY_ID", type text}, {"PERSON_EMAIL", type text}, {"EVENT_NAME", type text}, {"EVENT_DATE", type date}, {"EVENT_5", type text}, {"EVENT_5 EMAIL", type text}, {"EVENT_5 DATE", type date}, {"EVENT_4", type text}, {"EVENT_4 EMAIL", type text}, {"EVENT_4 DATE", type date}, {"EVENT_3", type text}, {"EVENT_3 EMAIL", type text}, {"EVENT_3 DATE", type date}, {"EVENT_2", type text}, {"EVENT_2 EMAIL", type text}, {"EVENT_2 DATE", type date}, {"EVENT_1", type text}, {"EVENT_1 EMAIL", type text}, {"EVENT_1 DATE", type date}})
in
    #"Changed Type1"

尝试使用 CTE 和 pivot 解决方案。

with cte1 as (
select * from
(
select INVOICE_ID,INVOICE_DATE,INVOICE_TOTAL,PERSON_COMPANY_ID,
event_name, 'event_'||rn event1 
from (
select INVOICE_ID,INVOICE_DATE,INVOICE_TOTAL,PERSON_COMPANY_ID,
event_name,dd,rn 
from (
select INVOICE_ID,INVOICE_DATE,INVOICE_TOTAL,PERSON_COMPANY_ID,
event_name,datediff(day,event_date,invoice_date) dd, 
row_number() over (partition by invoice_id order by dd desc) as rn 
from invoice1 where dd<=90
)
where rn<=5
) x
)
pivot (max(event_name) 
for 
event1 in ('event_1','event_2','event_3','event_4','event_5')) as pvt
),
cte2 as (
select * from
(
select INVOICE_ID,INVOICE_DATE,INVOICE_TOTAL,PERSON_COMPANY_ID,PERSON_EMAIL, 
'event_'||rn||'_email' event1 
from (
select INVOICE_ID,INVOICE_DATE,INVOICE_TOTAL,PERSON_COMPANY_ID,PERSON_EMAIL,
dd,rn 
from (
select INVOICE_ID,INVOICE_DATE,INVOICE_TOTAL,PERSON_COMPANY_ID,PERSON_EMAIL,
datediff(day,event_date,invoice_date) dd, 
row_number() over (partition by invoice_id order by dd desc) as rn 
from invoice1 where dd<=90
)
where rn<=5
) x
)
pivot (max(PERSON_EMAIL) 
for 
event1 in ('event_1_email','event_2_email','event_3_email','event_4_email','event_5_email')) as pvt
),
cte3 as (
select * from
(
select INVOICE_ID,INVOICE_DATE,INVOICE_TOTAL,PERSON_COMPANY_ID,EVENT_DATE, 
'event_'||rn||'_date' event1 
from (
select INVOICE_ID,INVOICE_DATE,INVOICE_TOTAL,PERSON_COMPANY_ID,EVENT_DATE,
dd,rn 
from (
select INVOICE_ID,INVOICE_DATE,INVOICE_TOTAL,PERSON_COMPANY_ID,EVENT_DATE,
datediff(day,event_date,invoice_date) dd, 
row_number() over (partition by invoice_id order by dd desc) as rn 
from invoice1 where dd<=90
)
where rn<=5
) x
)
pivot (max(EVENT_DATE) 
for 
event1 in ('event_1_date','event_2_date','event_3_date','event_4_date','event_5_date')) as pvt
)
select 
cte1.invoice_id,cte1.invoice_date,cte1.invoice_total,cte1.person_company_id,
cte1."'event_1'",cte2."'event_1_email'",cte3."'event_1_date'",
cte1."'event_2'",cte2."'event_2_email'",cte3."'event_2_date'",
cte1."'event_3'",cte2."'event_3_email'",cte3."'event_3_date'",
cte1."'event_4'",cte2."'event_4_email'",cte3."'event_4_date'",
cte1."'event_5'",cte2."'event_5_email'",cte3."'event_5_date'"
from cte1,cte2,cte3
where cte1.invoice_id=cte2.invoice_id
and cte2.invoice_id=cte3.invoice_id ;

此(在 CTE 内)的主要查询是 -

select INVOICE_ID,INVOICE_DATE,INVOICE_TOTAL,PERSON_COMPANY_ID,EVENT_DATE,datediff(day,event_date,invoice_date) dd, 
row_number() over (partition by invoice_id order by dd desc) as rn from invoice1 where dd<=90

遵循 table 定义为 -

table invoice1
(
INVOICE_ID number,
INVOICE_DATE date,
INVOICE_TOTAL varchar2(100),
PERSON_COMPANY_ID varchar2(100),
PERSON_EMAIL varchar2(100),
EVENT_NAME varchar2(100),
EVENT_DATE date
)