pandas 中的递归 CTE/传递闭包

Recursive CTE / transitive closure in pandas

我的场景:

我可以使用递归 CTE 来做到这一点。但是,我的主管要求我为此寻找替代方法:(.

递归 CTE 代码:

with recursive cte as (
      select ID, Email, MobileNo, DeviceId, IPAddress, id as tracking
      from tableuser
      where isfraudsterstatus = 1
      union all
      select u.id, u.email, u.mobileno, u.deviceid, u.ipaddress , concat_ws(',', cte.tracking, u.id)
      from cte join
           tableuser u
           on u.email = cte.email or
              u.mobileno = cte.mobileno or
              u.deviceid = cte.deviceid or 
              u.ipaddress = cte.ipaddress
      where find_in_set(u.id, cte.tracking) = 0
     )
select *
from cte;

输出:

嗯,我可以使用 Python 来做到这一点吗?我在考虑 pandas

import numpy as np
import pandas as pd
import functools
df = pd.DataFrame({'userId':
                       [1, 2, 3, 4,],
                   'phone':
                       ['01111', '01111', '53266', '7455'],
                   'email':
                       ['aziz@gmail', 'aziz1@gmail', 'aziz1@gmail', 'aziz2@gmail'],
                   'deviceId':
                       ['Ab123', 'Ab1234', 'Ab12345', 'Ab12345'],
                   'isFraud':
                   [1,0,0,0]})

这是一个解决方案。它基本上计算了欺诈者用户的传递闭包:

df = pd.DataFrame({'userId':
                       [1, 2, 3, 4,],
                   'phone':
                       ['01111', '01111', '53266', '7455'],
                   'email':
                       ['aziz@gmail', 'aziz1@gmail', 'aziz1@gmail', 'aziz2@gmail'],
                   'deviceId':
                       ['Ab123', 'Ab1234', 'Ab12345', 'Ab12345'],
                   'isFraud':
                   [1,0,0,0]})


def expand_fraud(no_fraud, fraud, col_name):
    t = pd.merge(no_fraud, fraud, on = col_name)
    if len(t):
        print(f"Found Match on {col_name}")
        df.loc[df.userId.isin(t.userId_x), "isFraud"] = 1
        return True
    return False

while True:
    added_fraud = False
    fraud = df[df.isFraud == 1]
    no_fraud = df[df.isFraud == 0]
    added_fraud |= expand_fraud(no_fraud, fraud, "deviceId")
    added_fraud |= expand_fraud(no_fraud, fraud, "email")
    added_fraud |= expand_fraud(no_fraud, fraud, "phone")   
    if not added_fraud:
        break

print(df)

输出为:

   userId  phone        email deviceId  isFraud
0       1  01111   aziz@gmail    Ab123        1
1       2  01111  aziz1@gmail   Ab1234        1
2       3  53266  aziz1@gmail  Ab12345        1
3       4   7455  aziz2@gmail  Ab12345        1