优化方程参数值,以便创建最大的组间距离

Optimization of equation parameter values such that largest distance between groups is created

对于特定的基因评分系统,我想建立一个基本图,以便根据多个基因测量值,输入的新样本值立即被吸引到图中的健康或不健康组。假设我们有 5 个人,每人测量了 6 个基因。

Import pandas as pd
import matplotlib.pyplot as plt
import numpy as np


df = pd.DataFrame(np.array([[A, 1, 1.2, 1.4, 2, 2], [B, 1.5, 1, 1.4, 1.3, 1.2], [C, 1, 1.2, 1.6, 2, 1.4], [D, 1.7, 1.5, 1.5, 1.5, 1.4], [E, 1.6, 1.9, 1.8, 3, 2.5], [F, 2, 2.2, 1.9, 2, 2]]), columns=['Gene', 'Healthy 1', 'Healthy 2', 'Healthy 3', 'Unhealthy 1', 'Unhealthy 2'])

这将创建以下 table:

Gene Healthy 1 Healthy 2 Healthy 3 Unhealthy 1 Unhealthy 2
A 1.0 1.2 1.4 2.0 2.0
B 1.5 1.0 1.4 1.3 1.2
C 1.0 1.2 1.6 2.0 1.4
D 1.7 1.5 1.5 1.5 1.4
E 1.6 1.9 1.8 3.0 2.5
F 2.0 2.2 1.9 2.0 2.0

每个样本的 X 和 Y 坐标然后根据将基因的贡献乘以 parameter/weight * 测量值后相加计算得出。前 4 个基因对 Y 值有贡献,而基因 5 和 6 决定 X 值。 wA - wF 是与其基因 A-F 对应物相关的 parameter/weights。

wA = .15 
wB = .25
wC = .35
wD = .45
wE = .50
wF = .60

n=0

for n in range (5):

y1 = df.iat[0,n]
y2 = df.iat[1,n]
y3 = df.iat[2,n]
y4 = df.iat[3,n]

TrueY = wA*y1+wB*y2+wC*y3+wD*y4

x1 = df.iat[4,n]
x2 = df.iat[5,n]

TrueX = (wE*x1+wF*x2)

result = (TrueX, TrueY)

n += 1

label = f"({TrueX},{TrueY})"

plt.scatter(TrueX, TrueY, alpha=0.5)
plt.annotate(label, (TrueX,TrueY), textcoords="offset points", xytext=(0,10), ha='center')

因此我们计算所有坐标并绘制它们

Plot

我现在想做的是找出如何优化 wA-wF parameter/weights 以便将健康样本推向绘图的原点,比方说 (0.0),同时不健康的样本被推向一个合理的相反点,比方说 (1,1)。我调查了 K-means/SVM,但作为一个 novice-coder/biochemist,我完全不知所措,非常感谢任何可用的帮助。

这是一个使用 scipy.optimize 并结合您的代码的示例。 (由于您的代码包含一些语法和类型错误,我不得不进行小幅更正。)

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

df = pd.DataFrame(np.array([[1, 1.2, 1.4, 2, 2],
                            [1.5, 1, 1.4, 1.3, 1.2],
                            [1, 1.2, 1.6, 2, 1.4],
                            [1.7, 1.5, 1.5, 1.5, 1.4],
                            [1.6, 1.9, 1.8, 3, 2.5],
                            [2, 2.2, 1.9, 2, 2]]),
                  columns=['Healthy 1', 'Healthy 2', 'Healthy 3', 'Unhealthy 1', 'Unhealthy 2'],
                  index=[['A', 'B', 'C', 'D', 'E', 'F']])

wA = .15
wB = .25
wC = .35
wD = .45
wE = .50
wF = .60

from scipy.optimize import minimize

# use your given weights as the initial guess
w0 = np.array([wA, wB, wC, wD, wE, wF])

# the objective function to be minimized
# - it computes the (square of) the samples' distances to (0,0) resp. (1,1)
def fun(w):
    weighted = df.values*w[:, None] # multiply all sample values by their weight
    y = sum(weighted[:4])           # compute all 5 "TrueY" coordinates
    x = sum(weighted[4:])           # compute all 5 "TrueX" coordinates
    y[3:] -= 1                      # adjust the "Unhealthy" y to the target (x,1)
    x[3:] -= 1                      # adjust the "Unhealthy" x to the target (1,y)
    return sum(x**2+y**2)           # return the sum of (squared) distances

res = minimize(fun, w0)
print(res)

# assign the optimized weights back to your parameters
wA, wB, wC, wD, wE, wF = res.x

# this is mostly your unchanged code
for n in range (5):

    y1 = df.iat[0,n]
    y2 = df.iat[1,n]
    y3 = df.iat[2,n]
    y4 = df.iat[3,n]

    TrueY = wA*y1+wB*y2+wC*y3+wD*y4

    x1 = df.iat[4,n]
    x2 = df.iat[5,n]

    TrueX = (wE*x1+wF*x2)

    result = (TrueX, TrueY)

    label = f"({TrueX:.3f},{TrueY:.3f})"

    plt.scatter(TrueX, TrueY, alpha=0.5)
    plt.annotate(label, (TrueX,TrueY), textcoords="offset points", xytext=(0,10), ha='center')

plt.savefig("mygraph.png")

这会产生参数 [ 1.21773653, 0.22185886, -0.39377451, -0.76513658, 0.86984207, -0.73166533] 作为解数组。由此我们可以看到健康样本聚集在 (0,0) 周围,不健康样本聚集在 (1,1) 周围:

您可能想尝试其他优化方法 - 请参阅 scipy.optimize.minimize