基于两列比较数据框中的行并对其进行排名?

Compare and rank rows in dataframe based on two columns?

我想弄清楚如何根据两个条件对 pandas 数据框中的多行进行比较和排名。

这些是条件:

rule1 < rule2 

if support(rule1) <= support(rule2) and confidence(rule1) < confidence(rule2) 

or support(rule1) < support(rule2) and confidence(rule1) <= confidence(rule2)

    
rule1 = rule2 

if support(rule1) = support(rule2) and confidence(rule1) = confidence(rule2)

我的数据框是这样设置的:

import pandas as pd

data = {
'rules': [(4444, 5555), (8747, 1254), (7414, 1214), (5655, 6651), (4454, 3321), (4893, 4923), (1271, 8330), (9112, 4722), (4511, 6722), (1102, 5789), (2340, 5720), (9822, 5067)],
'support': [0.0048, 0.00141, 0.0085, 0.00106, 0.00106, 0.00038, 0.00179, 0.00913, 0.00221, 0.00173, 0.00098, 0.00024],
'confidence': [0.873015, 0.533333, 0.593220, 0.012060, 0.012060, 0.237699, 0.453423, 0.097672, 0.116983, 0.541221, 0.743222, 0.378219]
}

df = pd.DataFrame(data=data, index=data['rules']).drop(columns=['rules'])

   
  (Index)
   Rules       Support     Confidence
(4444, 5555)   0.0048      0.873015
(8747, 1254)   0.00141     0.533333
(7414, 1214)   0.0085      0.593220
(5655, 6651)   0.00106     0.012060
(4454, 3321)   0.00106     0.012060
(4893, 4923)   0.00038     0.237699
(1271, 8330)   0.00179     0.453423
(9112, 4722)   0.00913     0.097672
(4511, 6722)   0.00221     0.116983
(1102, 5789)   0.00173     0.541221
(2340, 5720)   0.00098     0.743222
(9822, 5067)   0.00024     0.378219

这是我想要的数据框的外观(不确定排名到底是多少......这是假设的排名)

   (Index)
    Rules      Support     Confidence    Rank
(7414, 1214)   0.0085      0.593220        1
(4444, 5555)   0.0048      0.873015        2
(5655, 6651)   0.00106     0.012060        3
(4454, 3321)   0.00106     0.012060        3
(8747, 1254)   0.00141     0.533333        4
(1271, 8330)   0.00179     0.453423        5
(1102, 5789)   0.00173     0.541221        6
(2340, 5720)   0.00098     0.743222        7
(9822, 5067)   0.00024     0.378219        8
(9112, 4722)   0.00913     0.097672        9
(4511, 6722)   0.00221     0.116983        10
(4893, 4923)   0.00038     0.237699        11

我对如何让这段代码工作有了一些想法,但我不确定如何对每条规则与每条规则进行比较。我想要根据条件浮动到顶部的最佳规则。它不是一个大数据帧(< 1000)所以我真的不关心速度只关心准确性。

这是我目前得到的代码:

def rank_rules(confidence, support):

    # IF / ELSE goes here
   
    df['rank'] = some_var.rank(method='max')
  
    df.sort_values(by=['rank'], ascending=False)

    return df


df = df.apply(lambda x: rank_rules(x['confidence'], x['support']), axis=1)
 

解决方案:建议的方法

如果我没理解错的话,你是在尝试创建一个基于多列的排名系统(supportconfidence)。您可以将这两个视为 scatter-plot 上的两个正交轴 (xy)。在没有进一步的 sorting-logic 的情况下,我假设 euclidean-distance 是我们可以在这里用来对行进行排序以创建排名的东西。

处理数据

我在这里展示了使用 MinMaxScaler 可能是一个选项(除了 可选 使用 zscore)。

代码

import numpy as np
import pandas as pd
import matplotlib.pylab as plt
from sklearn.preprocessing import MinMaxScaler

%matplotlib inline 
%config InlineBackend.figure_format = 'svg' # 'svg', 'retina' 
plt.style.use('seaborn-white')

df = df.reset_index(drop=False).rename(columns={'index': 'rules'})
df['distance'] = (df.support**2 + df.confidence**2)**0.5
df['zsupport'] = (df.support - df.support.mean())/df.support.std()
df['zconfidence'] = (df.confidence - df.confidence.mean())/df.confidence.std()
df['zdistance'] = (df.zsupport**2 + df.zconfidence**2)**0.5

round_strategy = {
    'support': 5,
    'confidence': 6,
    'distance': 5,
}

scaler = MinMaxScaler()
df2 = pd.DataFrame(scaler.fit_transform(df[['zsupport', 'zconfidence']]), 
                   columns=['scaled_support', 'scaled_confidence'])
df = pd.concat([df, df2], ignore_index=False, axis=1)
df['scaled_distance'] = (df.scaled_support**2 + df.scaled_confidence**2)**0.5
df = df.sort_values(['scaled_distance'], ascending=False).reset_index(drop=True)
df['Rank'] = df.index

decimals = dict()
for col in df.columns:
    for key, value in round_strategy.items():
        if key in col:
            decimals.update({col: value})
df = df.round(decimals=decimals)

sizes = (df.shape[0] - df.Rank)/df.shape[0]
colors = round(255*sizes).astype(int)
df

情节

import plotly.express as px

fig = px.scatter(df4, x="scaled_support", y="scaled_confidence", text="Rank", 
                  log_x=False, size_max=20, 
                  color="Rank", 
                  size=(np.arange(df4.index.size) + 4)[::-1], 
                  hover_data=df4.columns)
fig.update_traces(textposition='top center')
fig.update_layout(title_text='Support vs. Confidence with Rank', title_x=0.5)
fig.show()

虚拟数据

import pandas as pd

data = {
'rules': [(4444, 5555), (8747, 1254), (7414, 1214), (5655, 6651), (4454, 3321), (4893, 4923), (1271, 8330), (9112, 4722), (4511, 6722), (1102, 5789), (2340, 5720), (9822, 5067)],
'support': [0.0048, 0.00141, 0.0085, 0.00106, 0.00106, 0.00038, 0.00179, 0.00913, 0.00221, 0.00173, 0.00098, 0.00024],
'confidence': [0.873015, 0.533333, 0.593220, 0.012060, 0.012060, 0.237699, 0.453423, 0.097672, 0.116983, 0.541221, 0.743222, 0.378219]
}

df = pd.DataFrame(data=data, index=data['rules']).drop(columns=['rules'])