如何从共享事件遍历所有记录到网络行?
How to iterate over all records to network rows from shared events?
因此,我正在实施一个用户网络系统,其中我从日志文件中获取记录,该日志文件仅列出每个用户和他们记录的会话。
这是格式的示例:
SessionNumber | UserID
10000 | A0000001
10001 | B3460009
... | ...
这些会话中的每一个都可以(并且经常)在每个会话中有多个用户
目的是为每个用户分配一个网络 ID。网络 ID 的规则
是:
如果会话中的用户没有分配网络 ID,则为他们分配一个新的网络 ID(尚未分配给网络的最小唯一正整数)
如果会话中的用户已经属于某个网络,则该会话中的所有用户都分配有该网络 ID
如果一个会话中的多个用户已经属于网络,则采用最小的网络 ID 并将其分配给该会话中的所有用户以及关联网络中的所有用户
例如,与用户 A、B 和 C 发生会话。下面的 table 显示他们在会话之前的网络 ID:
User | Network ID
A | -
B | 6
C | 8
此外网络 8 中的用户是:
User | Network ID
D | 8
E | 8
代码的预期结果是:
User | Network ID
A 6
B 6
C 6
D 6
E 6
我已经在 RStudio 中开发了代码,通过顺序浏览新日志文件中的所有会话、检查网络并分配适当的值
问题出现了,理想情况下,这个逻辑需要部署在 SQL 环境中,Teradata via Teradata SQL-Assistant only
我没有太多在 SQL 中编写逻辑或 for 循环的经验,只真正做过批量查询
这仅在 SQL 中可行吗?如果可以,建议 resources/direction 实现该目标?
谢谢!
如前所述,我已经在 RStudio 中对此进行了编码,但它需要临时持有者变量并且编码效率不高,我不知道它是否可以直接转换为 SQL
根据我的研究,我发现游标可能是可行的方法,但我找到的几乎所有资源都是针对 MySQL 或 SQL 服务器中的 SQL 而不是 Teradata 环境
编辑:下面是我来自 RStudio 的代码,它实现了网络任务
Check_List <- Sessions_n_long_Multi %>% select(User_ID) %>% distinct() #get a list of all Users listed distinctly
Check_List$Network_ID <- 0 #create new column which stores Network_ID (which User-Network a User belongs to)
Next_ID <- 0 #Initialise Network_ID counter
for (row_count in 1:nrow(Sessions_n_long_Multi)){ #for all sessions in the long format session-user attribution table
snl_holder_session <- Sessions_n_long_Multi[row_count,1] #get the session number for the current target
snl_holder_pid <- Sessions_n_long_Multi[row_count,2] #get the User_ID for the current target
CL_match <- Check_List %>% filter(User_ID == snl_holder_pid) #get the CheckList data which matches the target User_ID
if(CL_match[1,2]==0){ #if the target user has no Network_ID then..
matched_sessions_user <- Sessions_n_long_Multi %>% filter(User_ID == snl_holder_pid) #get all sessions associated with that user
matched_sessions_all <- Sessions_n_long_Multi %>% filter(SessionNumber %in% matched_sessions_user$SessionNumber) #then get all users associated with that session
matched_Users <- Check_List %>% filter(User_ID %in% matched_sessions_all$User_ID) #then get a list of all those users data from Check List
if(max(matched_Users$Network_ID)>0){ #if the maximum Session ID allocation is greater than zero (so someone already has an allocation in that network)
group_session_IDs <- matched_Users %>% filter(Network_ID!=0) %>% select(Network_ID) #filter out any users who have not been allocated (to remove zero as minimum)
group_session_ID <- min(group_session_IDs$Network_ID) #get the minimum value for Network_ID out of all users who are allocated a Network_ID
associated_networks <- Check_List %>% filter(Network_ID %in% group_session_IDs$Network_ID) #get all those from all the associated networks
Check_List$Network_ID[match(matched_Users$User_ID, Check_List$User_ID)] <- group_session_ID#update the Network_ID for all those in the immediate group
Check_List$Network_ID[match(associated_networks$User_ID, Check_List$User_ID)] <- group_session_ID#update the Network_ID for all those in the extended group
}else{#if the list of users all have no allocation at all
Next_ID <- Next_ID+1 #iterate the Network_ID counter up
Check_List$Network_ID[match(matched_Users$User_ID, Check_List$User_ID)] <-Next_ID #assign the network with the new Network_ID
}#end of if(max(matched_Users$Network_ID)>0) (which has different options depending on whether or not the associated users/sessions with target have an assigned network)
}#end of if(CL_match[1,2]==0) check (which checks if the current target is in a network)
}#end of loop
编辑:下一段代码是我目前的解决方案。我现在已经有了一些工作逻辑,因为我可以为每个用户分配一个 1-7 的数字,现在我正在努力为每个会话分配他们自己的网络 ID。接下来的重点是尝试一个网络中已经存在用户的情况,然后最后,当多个现有网络加入时
CREATE MULTISET TABLE db.AG_Sessions(
WEB_SESSION int,
USER_ID char(8)
);
INSERT INTO db.AG_Sessions values(1,'a');
INSERT INTO db.AG_Sessions values(1,'b');
INSERT INTO db.AG_Sessions values(2,'c');
INSERT INTO db.AG_Sessions values(2,'d');
INSERT INTO db.AG_Sessions values(3,'e');
INSERT INTO db.AG_Sessions values(3,'f');
INSERT INTO db.AG_Sessions values(3,'a');
CREATE MULTISET TABLE db.AG_Networks(
USER_ID char(8),
NetworkID int
)
CREATE Procedure db.AG_Table_Create()
BEGIN
CREATE VOLATILE TABLE AG_Multi_Check AS( --this volatile table will hold all multi sessions and all current attributions connected to the users in those sessions
SEL
a.WEB_SESSION, --get web session from Session Log file
a.USER_ID , --get USER_ID from Session Log file
b.NetworkID --get network ID from Network Log file
FROM db.AG_Sessions as a --source for Session Log File as a
LEFT JOIN --left joined ( to allow for full preservation of Session Log data even if no network assigned
db.AG_Networks as b --source for Network Log File as b
ON --joining on
a.USER_ID = b.USER_ID --joining where the USER_IDs are the same
WHERE --this where clause ensures that the sessions and users selected are from multi-user sessions
WEB_SESSION in (
SEL WEB_SESSION
FROM db.AG_Sessions
GROUP BY WEB_SESSION
HAVING COUNT(WEB_SESSION) > 1
)
)WITH DATA --create table with data
ON COMMIT PRESERVE ROWS; --populate volatile table with data
END;
Create procedure db.new_trail()
Begin --begin procedure
--declare variables
Declare Current_Account char(8); --holder for current account name
Declare Current_Net,Previous_Net,Current_Session,Previous_Session,Max_Net,New_Net int; --holder for current and previous session and network IDs as well as Max current network value
Declare Cursor_Import cursor For --declare the cursor which will iterate through the new input values (FROM THE VOLATILE TABLE)
Select WEB_SESSION,USER_ID,NetworkID from AG_Multi_Check ORDER By WEB_SESSION; --This pointer will select ALL data from the VOLATILE table
Open Cursor_Import; --open the cursor
--initialise variables
SET Previous_Session = 0;
SET Previous_Net = 0;
SET Max_Net = (SELECT COALESCE(max(NetworkID),0) from db.AG_Networks); --set the value of the max current networkID as either the maximum from the list, or if that does not exist, then sets it to zero
SET New_Net = Max_Net+1; -- this increments the value of the maximum network value by 1
Label_loop:
LOOP
Fetch NEXT from Cursor_Import into Current_Session,Current_Account,Current_Net;
IF SQLSTATE = 02000 THEN
LEAVE Label_loop;
END IF;
Insert into db.AG_Networks(USER_ID ,NetworkID) values(Current_Account,New_Net);
SET Previous_Session = Current_Session;
SET Previous_Net = Current_Net;
SET New_Net = New_Net + 1;
END LOOP Label_loop;
Close Cursor_Import;
End;
Call db.new_trail();
您当前的方法模仿 R 代码,并且在 Teradata 上会非常慢,因为它使用游标(逐行处理始终是连续的,在并行系统中尤其糟糕)。
根据您的示例数据(感谢您提供示例表),此 SP 应根据您的规则更新网络。首先,它为新用户分配新 ID,然后更新新 ID(需要一个循环,因为每个用户可能有多个会话)。
REPLACE PROCEDURE update_network()
BEGIN
MERGE INTO AG_Networks AS tgt
USING
(/* assign a Network ID to new users */
SELECT
s.USER_ID
,Rank() Over (ORDER BY s.USER_ID) -- new sequence
+ Coalesce((SELECT Max(NetworkID) FROM AG_Networks), 0) AS NetworkID -- previous max ID
FROM AG_Sessions AS s --source for Session Log File as a
WHERE NOT EXISTS
( -- only new users
SELECT *
FROM AG_Networks AS n
WHERE s.USER_ID = n.USER_ID
)
GROUP BY 1
) AS src
ON src.USER_ID = tgt.USER_ID
WHEN NOT MATCHED THEN
INSERT (USER_ID, NetworkID)
VALUES (
src.USER_ID
,src.NetworkID
);
REPEAT -- Update to the new NetworkID per user
MERGE INTO AG_Networks AS tgt
USING
( -- Lowest NetworkID per user from multiple session
SELECT USER_ID, Min(new_ID) AS new_ID
FROM
( -- Lowest NetworkID per session
SELECT s.WEB_SESSION, n.USER_ID, NetworkID
,Min(NetworkID) Over (PARTITION BY s.WEB_SESSION) AS new_ID
FROM AG_Networks AS n
LEFT JOIN AG_Sessions AS s
ON s.USER_ID = n.USER_ID
QUALIFY new_ID <> n.NetworkID -- only new IDs, also removes single user sessions
AND Count(s.WEB_SESSION) Over (PARTITION BY NetworkID)> 0 -- remove networks without new sessions
) AS dt
GROUP BY 1
) AS src
ON src.USER_ID = tgt.USER_ID
AND src.new_ID <> tgt.NetworkID -- new NetworkID for user
WHEN MATCHED THEN
UPDATE SET NetworkID = src.new_ID
;
-- no new IDs
UNTIL Activity_Count = 0
END REPEAT;
END;
因此,我正在实施一个用户网络系统,其中我从日志文件中获取记录,该日志文件仅列出每个用户和他们记录的会话。
这是格式的示例:
SessionNumber | UserID
10000 | A0000001
10001 | B3460009
... | ...
这些会话中的每一个都可以(并且经常)在每个会话中有多个用户
目的是为每个用户分配一个网络 ID。网络 ID 的规则 是:
如果会话中的用户没有分配网络 ID,则为他们分配一个新的网络 ID(尚未分配给网络的最小唯一正整数)
如果会话中的用户已经属于某个网络,则该会话中的所有用户都分配有该网络 ID
如果一个会话中的多个用户已经属于网络,则采用最小的网络 ID 并将其分配给该会话中的所有用户以及关联网络中的所有用户
例如,与用户 A、B 和 C 发生会话。下面的 table 显示他们在会话之前的网络 ID:
User | Network ID
A | -
B | 6
C | 8
此外网络 8 中的用户是:
User | Network ID
D | 8
E | 8
代码的预期结果是:
User | Network ID
A 6
B 6
C 6
D 6
E 6
我已经在 RStudio 中开发了代码,通过顺序浏览新日志文件中的所有会话、检查网络并分配适当的值
问题出现了,理想情况下,这个逻辑需要部署在 SQL 环境中,Teradata via Teradata SQL-Assistant only
我没有太多在 SQL 中编写逻辑或 for 循环的经验,只真正做过批量查询
这仅在 SQL 中可行吗?如果可以,建议 resources/direction 实现该目标?
谢谢!
如前所述,我已经在 RStudio 中对此进行了编码,但它需要临时持有者变量并且编码效率不高,我不知道它是否可以直接转换为 SQL
根据我的研究,我发现游标可能是可行的方法,但我找到的几乎所有资源都是针对 MySQL 或 SQL 服务器中的 SQL 而不是 Teradata 环境
编辑:下面是我来自 RStudio 的代码,它实现了网络任务
Check_List <- Sessions_n_long_Multi %>% select(User_ID) %>% distinct() #get a list of all Users listed distinctly
Check_List$Network_ID <- 0 #create new column which stores Network_ID (which User-Network a User belongs to)
Next_ID <- 0 #Initialise Network_ID counter
for (row_count in 1:nrow(Sessions_n_long_Multi)){ #for all sessions in the long format session-user attribution table
snl_holder_session <- Sessions_n_long_Multi[row_count,1] #get the session number for the current target
snl_holder_pid <- Sessions_n_long_Multi[row_count,2] #get the User_ID for the current target
CL_match <- Check_List %>% filter(User_ID == snl_holder_pid) #get the CheckList data which matches the target User_ID
if(CL_match[1,2]==0){ #if the target user has no Network_ID then..
matched_sessions_user <- Sessions_n_long_Multi %>% filter(User_ID == snl_holder_pid) #get all sessions associated with that user
matched_sessions_all <- Sessions_n_long_Multi %>% filter(SessionNumber %in% matched_sessions_user$SessionNumber) #then get all users associated with that session
matched_Users <- Check_List %>% filter(User_ID %in% matched_sessions_all$User_ID) #then get a list of all those users data from Check List
if(max(matched_Users$Network_ID)>0){ #if the maximum Session ID allocation is greater than zero (so someone already has an allocation in that network)
group_session_IDs <- matched_Users %>% filter(Network_ID!=0) %>% select(Network_ID) #filter out any users who have not been allocated (to remove zero as minimum)
group_session_ID <- min(group_session_IDs$Network_ID) #get the minimum value for Network_ID out of all users who are allocated a Network_ID
associated_networks <- Check_List %>% filter(Network_ID %in% group_session_IDs$Network_ID) #get all those from all the associated networks
Check_List$Network_ID[match(matched_Users$User_ID, Check_List$User_ID)] <- group_session_ID#update the Network_ID for all those in the immediate group
Check_List$Network_ID[match(associated_networks$User_ID, Check_List$User_ID)] <- group_session_ID#update the Network_ID for all those in the extended group
}else{#if the list of users all have no allocation at all
Next_ID <- Next_ID+1 #iterate the Network_ID counter up
Check_List$Network_ID[match(matched_Users$User_ID, Check_List$User_ID)] <-Next_ID #assign the network with the new Network_ID
}#end of if(max(matched_Users$Network_ID)>0) (which has different options depending on whether or not the associated users/sessions with target have an assigned network)
}#end of if(CL_match[1,2]==0) check (which checks if the current target is in a network)
}#end of loop
编辑:下一段代码是我目前的解决方案。我现在已经有了一些工作逻辑,因为我可以为每个用户分配一个 1-7 的数字,现在我正在努力为每个会话分配他们自己的网络 ID。接下来的重点是尝试一个网络中已经存在用户的情况,然后最后,当多个现有网络加入时
CREATE MULTISET TABLE db.AG_Sessions(
WEB_SESSION int,
USER_ID char(8)
);
INSERT INTO db.AG_Sessions values(1,'a');
INSERT INTO db.AG_Sessions values(1,'b');
INSERT INTO db.AG_Sessions values(2,'c');
INSERT INTO db.AG_Sessions values(2,'d');
INSERT INTO db.AG_Sessions values(3,'e');
INSERT INTO db.AG_Sessions values(3,'f');
INSERT INTO db.AG_Sessions values(3,'a');
CREATE MULTISET TABLE db.AG_Networks(
USER_ID char(8),
NetworkID int
)
CREATE Procedure db.AG_Table_Create()
BEGIN
CREATE VOLATILE TABLE AG_Multi_Check AS( --this volatile table will hold all multi sessions and all current attributions connected to the users in those sessions
SEL
a.WEB_SESSION, --get web session from Session Log file
a.USER_ID , --get USER_ID from Session Log file
b.NetworkID --get network ID from Network Log file
FROM db.AG_Sessions as a --source for Session Log File as a
LEFT JOIN --left joined ( to allow for full preservation of Session Log data even if no network assigned
db.AG_Networks as b --source for Network Log File as b
ON --joining on
a.USER_ID = b.USER_ID --joining where the USER_IDs are the same
WHERE --this where clause ensures that the sessions and users selected are from multi-user sessions
WEB_SESSION in (
SEL WEB_SESSION
FROM db.AG_Sessions
GROUP BY WEB_SESSION
HAVING COUNT(WEB_SESSION) > 1
)
)WITH DATA --create table with data
ON COMMIT PRESERVE ROWS; --populate volatile table with data
END;
Create procedure db.new_trail()
Begin --begin procedure
--declare variables
Declare Current_Account char(8); --holder for current account name
Declare Current_Net,Previous_Net,Current_Session,Previous_Session,Max_Net,New_Net int; --holder for current and previous session and network IDs as well as Max current network value
Declare Cursor_Import cursor For --declare the cursor which will iterate through the new input values (FROM THE VOLATILE TABLE)
Select WEB_SESSION,USER_ID,NetworkID from AG_Multi_Check ORDER By WEB_SESSION; --This pointer will select ALL data from the VOLATILE table
Open Cursor_Import; --open the cursor
--initialise variables
SET Previous_Session = 0;
SET Previous_Net = 0;
SET Max_Net = (SELECT COALESCE(max(NetworkID),0) from db.AG_Networks); --set the value of the max current networkID as either the maximum from the list, or if that does not exist, then sets it to zero
SET New_Net = Max_Net+1; -- this increments the value of the maximum network value by 1
Label_loop:
LOOP
Fetch NEXT from Cursor_Import into Current_Session,Current_Account,Current_Net;
IF SQLSTATE = 02000 THEN
LEAVE Label_loop;
END IF;
Insert into db.AG_Networks(USER_ID ,NetworkID) values(Current_Account,New_Net);
SET Previous_Session = Current_Session;
SET Previous_Net = Current_Net;
SET New_Net = New_Net + 1;
END LOOP Label_loop;
Close Cursor_Import;
End;
Call db.new_trail();
您当前的方法模仿 R 代码,并且在 Teradata 上会非常慢,因为它使用游标(逐行处理始终是连续的,在并行系统中尤其糟糕)。
根据您的示例数据(感谢您提供示例表),此 SP 应根据您的规则更新网络。首先,它为新用户分配新 ID,然后更新新 ID(需要一个循环,因为每个用户可能有多个会话)。
REPLACE PROCEDURE update_network()
BEGIN
MERGE INTO AG_Networks AS tgt
USING
(/* assign a Network ID to new users */
SELECT
s.USER_ID
,Rank() Over (ORDER BY s.USER_ID) -- new sequence
+ Coalesce((SELECT Max(NetworkID) FROM AG_Networks), 0) AS NetworkID -- previous max ID
FROM AG_Sessions AS s --source for Session Log File as a
WHERE NOT EXISTS
( -- only new users
SELECT *
FROM AG_Networks AS n
WHERE s.USER_ID = n.USER_ID
)
GROUP BY 1
) AS src
ON src.USER_ID = tgt.USER_ID
WHEN NOT MATCHED THEN
INSERT (USER_ID, NetworkID)
VALUES (
src.USER_ID
,src.NetworkID
);
REPEAT -- Update to the new NetworkID per user
MERGE INTO AG_Networks AS tgt
USING
( -- Lowest NetworkID per user from multiple session
SELECT USER_ID, Min(new_ID) AS new_ID
FROM
( -- Lowest NetworkID per session
SELECT s.WEB_SESSION, n.USER_ID, NetworkID
,Min(NetworkID) Over (PARTITION BY s.WEB_SESSION) AS new_ID
FROM AG_Networks AS n
LEFT JOIN AG_Sessions AS s
ON s.USER_ID = n.USER_ID
QUALIFY new_ID <> n.NetworkID -- only new IDs, also removes single user sessions
AND Count(s.WEB_SESSION) Over (PARTITION BY NetworkID)> 0 -- remove networks without new sessions
) AS dt
GROUP BY 1
) AS src
ON src.USER_ID = tgt.USER_ID
AND src.new_ID <> tgt.NetworkID -- new NetworkID for user
WHEN MATCHED THEN
UPDATE SET NetworkID = src.new_ID
;
-- no new IDs
UNTIL Activity_Count = 0
END REPEAT;
END;