如何确定每行的前 3 列值
How to determine top 3 column values for each row
我有一个格式为
的数据框
| ID | Payer | Payee | Mode1 | Probability1 | Mode2 | Probability2 | Mode3 | Probability3 | Mode4 | Probability4 | Month |
|----|-------|-------|-------|--------------|-------|--------------|--------|--------------|--------|--------------|--------|
| 1 | xyz | wqu | cash | 0.16 | wire | 0.89 | upi | 0.81 | cheque | 0.69 | 201801 |
| 2 | wqu | xyz | wire | 0.28 | cash | 0.19 | upi | 0.77 | cheque | 0.58 | 201801 |
| 3 | pqr | xyz | upi | 0.35 | cash | 0.11 | cheque | 0.48 | wire | 0.66 | 201803 |
概率列有模式列的对应值
现在我想为每一行按列获取前 3 个概率值
像这样,
| ID | Payer | Payee | Mode1 | Probability1 | Mode2 | Probability2 | Mode3 | Probability3 | Mode4 | Probability4 | Month | Top1Mode | Top1Value | Top2Mode | Top2Value | Top3Mode | Top3Value |
|----|-------|-------|-------|--------------|-------|--------------|--------|--------------|--------|--------------|--------|----------|-----------|----------|-----------|----------|-----------|
| 1 | xyz | wqu | cash | 0.16 | wire | 0.89 | upi | 0.81 | cheque | 0.69 | 201801 | wire | 0.89 | upi | 0.81 | cheque | 0.69 |
| 2 | wqu | xyz | wire | 0.28 | cash | 0.19 | upi | 0.77 | cheque | 0.58 | 201801 | upi | 0.77 | cheque | 0.58 | wire | 0.28 |
| 3 | pqr | xyz | upi | 0.35 | cash | 0.11 | cheque | 0.48 | wire | 0.66 | 201803 | wire | 0.66 | cheque | 0.48 | upi | 0.35 |
如果 table 不可见
为了进一步解释,对于第 1 行或 ID 1。电线具有最高概率(即 0.89),因此它位于 Top1Mode 列中,其值位于下一列中。类似地,UPI 具有第二高的概率,因此它在 Top2Mode 列中以及它在下一列中的值(即 Top2Value)
使用 Pandas 或 PySpark 进行操作,它们中的任何一个都适合我
我能想到的一件事是使用UDF(但我想看看有没有人有更好的解决方案):
@UDF
def getProbability(Mode1, Probability1, Mode2, Probability2, Mode3, Probability3, Mode4, Probability4, num, mode):
prob_list = []
prob_list.append((Mode1, Probability1))
prob_list.append((Mode2, Probability2))
prob_list.append((Mode3, Probability3))
prob_list.append((Mode4, Probability4))
prob_list = sorted(prob_list, key = lambda x: x[1], reverse=True)
if mode == "Mode":
return prob_list[num][0]
else:
return prob_list[num][1]
df = df.withColumn("Top1Mode", getProbability("Mode1", "Probability1", "Mode2", "Probability2", "Mode3", "Probability3", "Mode4", "Probability4", lit(0), lit("Mode"))) \
.withColumn("Top1Value", getProbability("Mode1", "Probability1", "Mode2", "Probability2", "Mode3", "Probability3", "Mode4", "Probability4", lit(0), lit("Prob"))) \
.withColumn("Top2Mode", getProbability("Mode1", "Probability1", "Mode2", "Probability2", "Mode3", "Probability3", "Mode4", "Probability4", lit(1), lit("Mode"))) \
.withColumn("Top2Value", getProbability("Mode1", "Probability1", "Mode2", "Probability2", "Mode3", "Probability3", "Mode4", "Probability4", lit(1), lit("Prob"))) \
.withColumn("Top3Mode", getProbability("Mode1", "Probability1", "Mode2", "Probability2", "Mode3", "Probability3", "Mode4", "Probability4", lit(2), lit("Mode"))) \
.withColumn("Top3Value", getProbability("Mode1", "Probability1", "Mode2", "Probability2", "Mode3", "Probability3", "Mode4", "Probability4", lit(2), lit("Prob")))
我有一个格式为
的数据框| ID | Payer | Payee | Mode1 | Probability1 | Mode2 | Probability2 | Mode3 | Probability3 | Mode4 | Probability4 | Month |
|----|-------|-------|-------|--------------|-------|--------------|--------|--------------|--------|--------------|--------|
| 1 | xyz | wqu | cash | 0.16 | wire | 0.89 | upi | 0.81 | cheque | 0.69 | 201801 |
| 2 | wqu | xyz | wire | 0.28 | cash | 0.19 | upi | 0.77 | cheque | 0.58 | 201801 |
| 3 | pqr | xyz | upi | 0.35 | cash | 0.11 | cheque | 0.48 | wire | 0.66 | 201803 |
概率列有模式列的对应值
现在我想为每一行按列获取前 3 个概率值
像这样,
| ID | Payer | Payee | Mode1 | Probability1 | Mode2 | Probability2 | Mode3 | Probability3 | Mode4 | Probability4 | Month | Top1Mode | Top1Value | Top2Mode | Top2Value | Top3Mode | Top3Value |
|----|-------|-------|-------|--------------|-------|--------------|--------|--------------|--------|--------------|--------|----------|-----------|----------|-----------|----------|-----------|
| 1 | xyz | wqu | cash | 0.16 | wire | 0.89 | upi | 0.81 | cheque | 0.69 | 201801 | wire | 0.89 | upi | 0.81 | cheque | 0.69 |
| 2 | wqu | xyz | wire | 0.28 | cash | 0.19 | upi | 0.77 | cheque | 0.58 | 201801 | upi | 0.77 | cheque | 0.58 | wire | 0.28 |
| 3 | pqr | xyz | upi | 0.35 | cash | 0.11 | cheque | 0.48 | wire | 0.66 | 201803 | wire | 0.66 | cheque | 0.48 | upi | 0.35 |
如果 table 不可见
为了进一步解释,对于第 1 行或 ID 1。电线具有最高概率(即 0.89),因此它位于 Top1Mode 列中,其值位于下一列中。类似地,UPI 具有第二高的概率,因此它在 Top2Mode 列中以及它在下一列中的值(即 Top2Value)
使用 Pandas 或 PySpark 进行操作,它们中的任何一个都适合我
我能想到的一件事是使用UDF(但我想看看有没有人有更好的解决方案):
@UDF
def getProbability(Mode1, Probability1, Mode2, Probability2, Mode3, Probability3, Mode4, Probability4, num, mode):
prob_list = []
prob_list.append((Mode1, Probability1))
prob_list.append((Mode2, Probability2))
prob_list.append((Mode3, Probability3))
prob_list.append((Mode4, Probability4))
prob_list = sorted(prob_list, key = lambda x: x[1], reverse=True)
if mode == "Mode":
return prob_list[num][0]
else:
return prob_list[num][1]
df = df.withColumn("Top1Mode", getProbability("Mode1", "Probability1", "Mode2", "Probability2", "Mode3", "Probability3", "Mode4", "Probability4", lit(0), lit("Mode"))) \
.withColumn("Top1Value", getProbability("Mode1", "Probability1", "Mode2", "Probability2", "Mode3", "Probability3", "Mode4", "Probability4", lit(0), lit("Prob"))) \
.withColumn("Top2Mode", getProbability("Mode1", "Probability1", "Mode2", "Probability2", "Mode3", "Probability3", "Mode4", "Probability4", lit(1), lit("Mode"))) \
.withColumn("Top2Value", getProbability("Mode1", "Probability1", "Mode2", "Probability2", "Mode3", "Probability3", "Mode4", "Probability4", lit(1), lit("Prob"))) \
.withColumn("Top3Mode", getProbability("Mode1", "Probability1", "Mode2", "Probability2", "Mode3", "Probability3", "Mode4", "Probability4", lit(2), lit("Mode"))) \
.withColumn("Top3Value", getProbability("Mode1", "Probability1", "Mode2", "Probability2", "Mode3", "Probability3", "Mode4", "Probability4", lit(2), lit("Prob")))