基于两列比较数据框中的行并对其进行排名?
Compare and rank rows in dataframe based on two columns?
我想弄清楚如何根据两个条件对 pandas 数据框中的多行进行比较和排名。
这些是条件:
rule1 < rule2
if support(rule1) <= support(rule2) and confidence(rule1) < confidence(rule2)
or support(rule1) < support(rule2) and confidence(rule1) <= confidence(rule2)
rule1 = rule2
if support(rule1) = support(rule2) and confidence(rule1) = confidence(rule2)
我的数据框是这样设置的:
import pandas as pd
data = {
'rules': [(4444, 5555), (8747, 1254), (7414, 1214), (5655, 6651), (4454, 3321), (4893, 4923), (1271, 8330), (9112, 4722), (4511, 6722), (1102, 5789), (2340, 5720), (9822, 5067)],
'support': [0.0048, 0.00141, 0.0085, 0.00106, 0.00106, 0.00038, 0.00179, 0.00913, 0.00221, 0.00173, 0.00098, 0.00024],
'confidence': [0.873015, 0.533333, 0.593220, 0.012060, 0.012060, 0.237699, 0.453423, 0.097672, 0.116983, 0.541221, 0.743222, 0.378219]
}
df = pd.DataFrame(data=data, index=data['rules']).drop(columns=['rules'])
(Index)
Rules Support Confidence
(4444, 5555) 0.0048 0.873015
(8747, 1254) 0.00141 0.533333
(7414, 1214) 0.0085 0.593220
(5655, 6651) 0.00106 0.012060
(4454, 3321) 0.00106 0.012060
(4893, 4923) 0.00038 0.237699
(1271, 8330) 0.00179 0.453423
(9112, 4722) 0.00913 0.097672
(4511, 6722) 0.00221 0.116983
(1102, 5789) 0.00173 0.541221
(2340, 5720) 0.00098 0.743222
(9822, 5067) 0.00024 0.378219
这是我想要的数据框的外观(不确定排名到底是多少......这是假设的排名)
(Index)
Rules Support Confidence Rank
(7414, 1214) 0.0085 0.593220 1
(4444, 5555) 0.0048 0.873015 2
(5655, 6651) 0.00106 0.012060 3
(4454, 3321) 0.00106 0.012060 3
(8747, 1254) 0.00141 0.533333 4
(1271, 8330) 0.00179 0.453423 5
(1102, 5789) 0.00173 0.541221 6
(2340, 5720) 0.00098 0.743222 7
(9822, 5067) 0.00024 0.378219 8
(9112, 4722) 0.00913 0.097672 9
(4511, 6722) 0.00221 0.116983 10
(4893, 4923) 0.00038 0.237699 11
我对如何让这段代码工作有了一些想法,但我不确定如何对每条规则与每条规则进行比较。我想要根据条件浮动到顶部的最佳规则。它不是一个大数据帧(< 1000)所以我真的不关心速度只关心准确性。
这是我目前得到的代码:
def rank_rules(confidence, support):
# IF / ELSE goes here
df['rank'] = some_var.rank(method='max')
df.sort_values(by=['rank'], ascending=False)
return df
df = df.apply(lambda x: rank_rules(x['confidence'], x['support']), axis=1)
解决方案:建议的方法
如果我没理解错的话,你是在尝试创建一个基于多列的排名系统(support,confidence)。您可以将这两个视为 scatter-plot 上的两个正交轴 (x
、y
)。在没有进一步的 sorting-logic 的情况下,我假设 euclidean-distance 是我们可以在这里用来对行进行排序以创建排名的东西。
处理数据
我在这里展示了使用 MinMaxScaler
可能是一个选项(除了 可选 使用 zscore
)。
代码
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
from sklearn.preprocessing import MinMaxScaler
%matplotlib inline
%config InlineBackend.figure_format = 'svg' # 'svg', 'retina'
plt.style.use('seaborn-white')
df = df.reset_index(drop=False).rename(columns={'index': 'rules'})
df['distance'] = (df.support**2 + df.confidence**2)**0.5
df['zsupport'] = (df.support - df.support.mean())/df.support.std()
df['zconfidence'] = (df.confidence - df.confidence.mean())/df.confidence.std()
df['zdistance'] = (df.zsupport**2 + df.zconfidence**2)**0.5
round_strategy = {
'support': 5,
'confidence': 6,
'distance': 5,
}
scaler = MinMaxScaler()
df2 = pd.DataFrame(scaler.fit_transform(df[['zsupport', 'zconfidence']]),
columns=['scaled_support', 'scaled_confidence'])
df = pd.concat([df, df2], ignore_index=False, axis=1)
df['scaled_distance'] = (df.scaled_support**2 + df.scaled_confidence**2)**0.5
df = df.sort_values(['scaled_distance'], ascending=False).reset_index(drop=True)
df['Rank'] = df.index
decimals = dict()
for col in df.columns:
for key, value in round_strategy.items():
if key in col:
decimals.update({col: value})
df = df.round(decimals=decimals)
sizes = (df.shape[0] - df.Rank)/df.shape[0]
colors = round(255*sizes).astype(int)
df
情节
import plotly.express as px
fig = px.scatter(df4, x="scaled_support", y="scaled_confidence", text="Rank",
log_x=False, size_max=20,
color="Rank",
size=(np.arange(df4.index.size) + 4)[::-1],
hover_data=df4.columns)
fig.update_traces(textposition='top center')
fig.update_layout(title_text='Support vs. Confidence with Rank', title_x=0.5)
fig.show()
虚拟数据
import pandas as pd
data = {
'rules': [(4444, 5555), (8747, 1254), (7414, 1214), (5655, 6651), (4454, 3321), (4893, 4923), (1271, 8330), (9112, 4722), (4511, 6722), (1102, 5789), (2340, 5720), (9822, 5067)],
'support': [0.0048, 0.00141, 0.0085, 0.00106, 0.00106, 0.00038, 0.00179, 0.00913, 0.00221, 0.00173, 0.00098, 0.00024],
'confidence': [0.873015, 0.533333, 0.593220, 0.012060, 0.012060, 0.237699, 0.453423, 0.097672, 0.116983, 0.541221, 0.743222, 0.378219]
}
df = pd.DataFrame(data=data, index=data['rules']).drop(columns=['rules'])
我想弄清楚如何根据两个条件对 pandas 数据框中的多行进行比较和排名。
这些是条件:
rule1 < rule2
if support(rule1) <= support(rule2) and confidence(rule1) < confidence(rule2)
or support(rule1) < support(rule2) and confidence(rule1) <= confidence(rule2)
rule1 = rule2
if support(rule1) = support(rule2) and confidence(rule1) = confidence(rule2)
我的数据框是这样设置的:
import pandas as pd
data = {
'rules': [(4444, 5555), (8747, 1254), (7414, 1214), (5655, 6651), (4454, 3321), (4893, 4923), (1271, 8330), (9112, 4722), (4511, 6722), (1102, 5789), (2340, 5720), (9822, 5067)],
'support': [0.0048, 0.00141, 0.0085, 0.00106, 0.00106, 0.00038, 0.00179, 0.00913, 0.00221, 0.00173, 0.00098, 0.00024],
'confidence': [0.873015, 0.533333, 0.593220, 0.012060, 0.012060, 0.237699, 0.453423, 0.097672, 0.116983, 0.541221, 0.743222, 0.378219]
}
df = pd.DataFrame(data=data, index=data['rules']).drop(columns=['rules'])
(Index)
Rules Support Confidence
(4444, 5555) 0.0048 0.873015
(8747, 1254) 0.00141 0.533333
(7414, 1214) 0.0085 0.593220
(5655, 6651) 0.00106 0.012060
(4454, 3321) 0.00106 0.012060
(4893, 4923) 0.00038 0.237699
(1271, 8330) 0.00179 0.453423
(9112, 4722) 0.00913 0.097672
(4511, 6722) 0.00221 0.116983
(1102, 5789) 0.00173 0.541221
(2340, 5720) 0.00098 0.743222
(9822, 5067) 0.00024 0.378219
这是我想要的数据框的外观(不确定排名到底是多少......这是假设的排名)
(Index)
Rules Support Confidence Rank
(7414, 1214) 0.0085 0.593220 1
(4444, 5555) 0.0048 0.873015 2
(5655, 6651) 0.00106 0.012060 3
(4454, 3321) 0.00106 0.012060 3
(8747, 1254) 0.00141 0.533333 4
(1271, 8330) 0.00179 0.453423 5
(1102, 5789) 0.00173 0.541221 6
(2340, 5720) 0.00098 0.743222 7
(9822, 5067) 0.00024 0.378219 8
(9112, 4722) 0.00913 0.097672 9
(4511, 6722) 0.00221 0.116983 10
(4893, 4923) 0.00038 0.237699 11
我对如何让这段代码工作有了一些想法,但我不确定如何对每条规则与每条规则进行比较。我想要根据条件浮动到顶部的最佳规则。它不是一个大数据帧(< 1000)所以我真的不关心速度只关心准确性。
这是我目前得到的代码:
def rank_rules(confidence, support):
# IF / ELSE goes here
df['rank'] = some_var.rank(method='max')
df.sort_values(by=['rank'], ascending=False)
return df
df = df.apply(lambda x: rank_rules(x['confidence'], x['support']), axis=1)
解决方案:建议的方法
如果我没理解错的话,你是在尝试创建一个基于多列的排名系统(support,confidence)。您可以将这两个视为 scatter-plot 上的两个正交轴 (x
、y
)。在没有进一步的 sorting-logic 的情况下,我假设 euclidean-distance 是我们可以在这里用来对行进行排序以创建排名的东西。
处理数据
我在这里展示了使用 MinMaxScaler
可能是一个选项(除了 可选 使用 zscore
)。
代码
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
from sklearn.preprocessing import MinMaxScaler
%matplotlib inline
%config InlineBackend.figure_format = 'svg' # 'svg', 'retina'
plt.style.use('seaborn-white')
df = df.reset_index(drop=False).rename(columns={'index': 'rules'})
df['distance'] = (df.support**2 + df.confidence**2)**0.5
df['zsupport'] = (df.support - df.support.mean())/df.support.std()
df['zconfidence'] = (df.confidence - df.confidence.mean())/df.confidence.std()
df['zdistance'] = (df.zsupport**2 + df.zconfidence**2)**0.5
round_strategy = {
'support': 5,
'confidence': 6,
'distance': 5,
}
scaler = MinMaxScaler()
df2 = pd.DataFrame(scaler.fit_transform(df[['zsupport', 'zconfidence']]),
columns=['scaled_support', 'scaled_confidence'])
df = pd.concat([df, df2], ignore_index=False, axis=1)
df['scaled_distance'] = (df.scaled_support**2 + df.scaled_confidence**2)**0.5
df = df.sort_values(['scaled_distance'], ascending=False).reset_index(drop=True)
df['Rank'] = df.index
decimals = dict()
for col in df.columns:
for key, value in round_strategy.items():
if key in col:
decimals.update({col: value})
df = df.round(decimals=decimals)
sizes = (df.shape[0] - df.Rank)/df.shape[0]
colors = round(255*sizes).astype(int)
df
情节
import plotly.express as px
fig = px.scatter(df4, x="scaled_support", y="scaled_confidence", text="Rank",
log_x=False, size_max=20,
color="Rank",
size=(np.arange(df4.index.size) + 4)[::-1],
hover_data=df4.columns)
fig.update_traces(textposition='top center')
fig.update_layout(title_text='Support vs. Confidence with Rank', title_x=0.5)
fig.show()
虚拟数据
import pandas as pd
data = {
'rules': [(4444, 5555), (8747, 1254), (7414, 1214), (5655, 6651), (4454, 3321), (4893, 4923), (1271, 8330), (9112, 4722), (4511, 6722), (1102, 5789), (2340, 5720), (9822, 5067)],
'support': [0.0048, 0.00141, 0.0085, 0.00106, 0.00106, 0.00038, 0.00179, 0.00913, 0.00221, 0.00173, 0.00098, 0.00024],
'confidence': [0.873015, 0.533333, 0.593220, 0.012060, 0.012060, 0.237699, 0.453423, 0.097672, 0.116983, 0.541221, 0.743222, 0.378219]
}
df = pd.DataFrame(data=data, index=data['rules']).drop(columns=['rules'])