如何从共享事件遍历所有记录到网络行?

How to iterate over all records to network rows from shared events?

因此,我正在实施一个用户网络系统,其中我从日志文件中获取记录,该日志文件仅列出每个用户和他们记录的会话。

这是格式的示例:

SessionNumber |  UserID                                                                   
10000     | A0000001                                                                   
10001     | B3460009                                                                   
  ...      |    ...                                                                   

这些会话中的每一个都可以(并且经常)在每个会话中有多个用户

目的是为每个用户分配一个网络 ID。网络 ID 的规则 是:

如果会话中的用户没有分配网络 ID,则为他们分配一个新的网络 ID(尚未分配给网络的最小唯一正整数)

如果会话中的用户已经属于某个网络,则该会话中的所有用户都分配有该网络 ID

如果一个会话中的多个用户已经属于网络,则采用最小的网络 ID 并将其分配给该会话中的所有用户以及关联网络中的所有用户

例如,与用户 A、B 和 C 发生会话。下面的 table 显示他们在会话之前的网络 ID:

User |  Network ID                                                                   
 A   |      -                                                                   
 B   |      6                                                                   
 C   |      8                                                                   

此外网络 8 中的用户是:

User |  Network ID                                                                   
 D   |      8                                                                   
 E   |      8                                                                    

代码的预期结果是:

User |  Network ID                                                                    
 A          6                                                                   
 B          6                                                                                                                                      
 C          6                                                                   
 D          6                                                                   
 E          6                                                                   

我已经在 RStudio 中开发了代码,通过顺序浏览新日志文件中的所有会话、检查网络并分配适当的值

问题出现了,理想情况下,这个逻辑需要部署在 SQL 环境中,Teradata via Teradata SQL-Assistant only

我没有太多在 SQL 中编写逻辑或 for 循环的经验,只真正做过批量查询

这仅在 SQL 中可行吗?如果可以,建议 resources/direction 实现该目标?

谢谢!

如前所述,我已经在 RStudio 中对此进行了编码,但它需要临时持有者变量并且编码效率不高,我不知道它是否可以直接转换为 SQL

根据我的研究,我发现游标可能是可行的方法,但我找到的几乎所有资源都是针对 MySQL 或 SQL 服务器中的 SQL 而不是 Teradata 环境

编辑:下面是我来自 RStudio 的代码,它实现了网络任务

Check_List <- Sessions_n_long_Multi %>% select(User_ID) %>% distinct() #get a list of all Users listed distinctly
Check_List$Network_ID <- 0 #create new column which stores Network_ID (which User-Network a User belongs to)

Next_ID <- 0 #Initialise Network_ID counter
for (row_count in 1:nrow(Sessions_n_long_Multi)){ #for all sessions in the long format session-user attribution table
  snl_holder_session <- Sessions_n_long_Multi[row_count,1] #get the session number for the current target
  snl_holder_pid <- Sessions_n_long_Multi[row_count,2] #get the User_ID for the current target
  CL_match <- Check_List %>% filter(User_ID == snl_holder_pid) #get the CheckList data which matches the target User_ID


  if(CL_match[1,2]==0){ #if the target user has no Network_ID then..
    matched_sessions_user <- Sessions_n_long_Multi %>% filter(User_ID == snl_holder_pid) #get all sessions associated with that user
    matched_sessions_all <- Sessions_n_long_Multi %>% filter(SessionNumber %in% matched_sessions_user$SessionNumber) #then get all users  associated  with that session
    matched_Users <- Check_List %>% filter(User_ID %in% matched_sessions_all$User_ID) #then get a list of all those users data from Check List

    if(max(matched_Users$Network_ID)>0){ #if the maximum Session ID allocation is greater than zero (so someone already has an allocation in that network)
      group_session_IDs <- matched_Users %>% filter(Network_ID!=0) %>% select(Network_ID) #filter out any users who have not been allocated (to remove zero as minimum)
      group_session_ID <- min(group_session_IDs$Network_ID) #get the minimum value for Network_ID out of all users who are allocated a Network_ID

      associated_networks <- Check_List %>% filter(Network_ID %in% group_session_IDs$Network_ID) #get all those from all the associated networks

      Check_List$Network_ID[match(matched_Users$User_ID, Check_List$User_ID)] <- group_session_ID#update the Network_ID for all those in the immediate group
      Check_List$Network_ID[match(associated_networks$User_ID, Check_List$User_ID)] <- group_session_ID#update the Network_ID for all those in the extended group

    }else{#if the list of users all have no allocation at all
      Next_ID <- Next_ID+1 #iterate the Network_ID counter up
      Check_List$Network_ID[match(matched_Users$User_ID, Check_List$User_ID)] <-Next_ID #assign the network with the new Network_ID
    }#end of if(max(matched_Users$Network_ID)>0) (which has different options depending on whether or not the associated users/sessions with target have an assigned network)

  }#end of if(CL_match[1,2]==0) check (which checks if the current target is in a network)
}#end of loop

编辑:下一段代码是我目前的解决方案。我现在已经有了一些工作逻辑,因为我可以为每个用户分配一个 1-7 的数字,现在我正在努力为每个会话分配他们自己的网络 ID。接下来的重点是尝试一个网络中已经存在用户的情况,然后最后,当多个现有网络加入时

CREATE MULTISET TABLE db.AG_Sessions(
WEB_SESSION int,
USER_ID char(8)
);

INSERT INTO db.AG_Sessions values(1,'a');
INSERT INTO db.AG_Sessions values(1,'b');
INSERT INTO db.AG_Sessions values(2,'c');
INSERT INTO db.AG_Sessions values(2,'d');
INSERT INTO db.AG_Sessions values(3,'e');
INSERT INTO db.AG_Sessions values(3,'f');
INSERT INTO db.AG_Sessions values(3,'a');

CREATE MULTISET TABLE db.AG_Networks(
USER_ID char(8),
NetworkID int
)

CREATE Procedure db.AG_Table_Create()
BEGIN
CREATE VOLATILE  TABLE AG_Multi_Check AS( --this volatile table will hold all multi sessions and all current attributions connected to the users in those sessions
SEL 
a.WEB_SESSION, --get web session from Session Log file
a.USER_ID , --get USER_ID from Session Log file
b.NetworkID --get network ID from Network Log file
 FROM db.AG_Sessions as a --source for Session Log File as a
 LEFT JOIN --left joined ( to allow for full preservation of Session Log data even if no network assigned
 db.AG_Networks as b --source for Network Log File as b
 ON --joining on
 a.USER_ID = b.USER_ID --joining where the USER_IDs are the same
WHERE --this where clause ensures that the sessions and users selected are from multi-user sessions
WEB_SESSION in (
SEL WEB_SESSION
FROM db.AG_Sessions
GROUP BY WEB_SESSION
HAVING COUNT(WEB_SESSION) > 1
)
)WITH DATA --create table with data
    ON COMMIT PRESERVE ROWS; --populate volatile table with data
END;

Create procedure db.new_trail()

Begin --begin procedure
--declare variables
Declare Current_Account char(8); --holder for current account name
Declare Current_Net,Previous_Net,Current_Session,Previous_Session,Max_Net,New_Net int; --holder for current and previous session and network IDs as well as Max current network value

Declare Cursor_Import  cursor  For --declare the cursor which will iterate through the new input values (FROM THE VOLATILE TABLE)
Select WEB_SESSION,USER_ID,NetworkID  from AG_Multi_Check ORDER By WEB_SESSION; --This pointer will select ALL data from the VOLATILE table

Open Cursor_Import; --open the cursor

--initialise variables
SET Previous_Session = 0;
SET Previous_Net = 0;
SET Max_Net = (SELECT COALESCE(max(NetworkID),0) from db.AG_Networks); --set the value of the max current networkID as either the maximum from the list, or if that does not exist, then sets it to zero
SET New_Net = Max_Net+1; -- this increments the value of the maximum network value by 1

Label_loop:
LOOP
Fetch NEXT from Cursor_Import into Current_Session,Current_Account,Current_Net; 
IF SQLSTATE = 02000 THEN
  LEAVE Label_loop;
END IF;

Insert into  db.AG_Networks(USER_ID ,NetworkID) values(Current_Account,New_Net);
SET Previous_Session = Current_Session;
SET Previous_Net = Current_Net;
SET New_Net = New_Net + 1;

END LOOP Label_loop;

Close Cursor_Import;
End;

Call db.new_trail();

您当前的方法模仿 R 代码,并且在 Teradata 上会非常慢,因为它使用游标(逐行处理始终是连续的,在并行系统中尤其糟糕)。

根据您的示例数据(感谢您提供示例表),此 SP 应根据您的规则更新网络。首先,它为新用户分配新 ID,然后更新新 ID(需要一个循环,因为每个用户可能有多个会话)。

REPLACE PROCEDURE update_network()
BEGIN

MERGE INTO AG_Networks AS tgt
USING
 (/* assign a Network ID to new users */
   SELECT 
      s.USER_ID
     ,Rank() Over (ORDER BY s.USER_ID) -- new sequence
      + Coalesce((SELECT Max(NetworkID) FROM AG_Networks), 0) AS NetworkID  -- previous max ID
   FROM AG_Sessions AS s --source for Session Log File as a
   WHERE NOT EXISTS
    ( -- only new users
      SELECT *
      FROM AG_Networks AS n
      WHERE s.USER_ID = n.USER_ID
    )
   GROUP BY 1
 ) AS src
ON src.USER_ID = tgt.USER_ID
WHEN NOT MATCHED THEN
INSERT (USER_ID, NetworkID)
VALUES (
   src.USER_ID
  ,src.NetworkID
 );

REPEAT -- Update to the new NetworkID per user

   MERGE INTO AG_Networks AS tgt
   USING
    ( -- Lowest NetworkID per user from multiple session
      SELECT USER_ID, Min(new_ID) AS new_ID
      FROM
       ( -- Lowest NetworkID per session
         SELECT s.WEB_SESSION, n.USER_ID, NetworkID
              ,Min(NetworkID) Over (PARTITION BY s.WEB_SESSION) AS new_ID
            FROM AG_Networks AS n
            LEFT JOIN AG_Sessions AS s
              ON s.USER_ID = n.USER_ID
            QUALIFY new_ID <> n.NetworkID -- only new IDs, also removes single user sessions
                AND Count(s.WEB_SESSION) Over (PARTITION BY NetworkID)> 0 -- remove networks without new sessions
       ) AS dt 
      GROUP BY 1
    ) AS src
   ON src.USER_ID = tgt.USER_ID
   AND src.new_ID <> tgt.NetworkID -- new NetworkID for user
   WHEN MATCHED THEN
   UPDATE SET NetworkID = src.new_ID
   ;
   -- no new IDs
   UNTIL Activity_Count = 0 
   END REPEAT;
END;