优化方程参数值,以便创建最大的组间距离
Optimization of equation parameter values such that largest distance between groups is created
对于特定的基因评分系统,我想建立一个基本图,以便根据多个基因测量值,输入的新样本值立即被吸引到图中的健康或不健康组。假设我们有 5 个人,每人测量了 6 个基因。
Import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
df = pd.DataFrame(np.array([[A, 1, 1.2, 1.4, 2, 2], [B, 1.5, 1, 1.4, 1.3, 1.2], [C, 1, 1.2, 1.6, 2, 1.4], [D, 1.7, 1.5, 1.5, 1.5, 1.4], [E, 1.6, 1.9, 1.8, 3, 2.5], [F, 2, 2.2, 1.9, 2, 2]]), columns=['Gene', 'Healthy 1', 'Healthy 2', 'Healthy 3', 'Unhealthy 1', 'Unhealthy 2'])
这将创建以下 table:
Gene
Healthy 1
Healthy 2
Healthy 3
Unhealthy 1
Unhealthy 2
A
1.0
1.2
1.4
2.0
2.0
B
1.5
1.0
1.4
1.3
1.2
C
1.0
1.2
1.6
2.0
1.4
D
1.7
1.5
1.5
1.5
1.4
E
1.6
1.9
1.8
3.0
2.5
F
2.0
2.2
1.9
2.0
2.0
每个样本的 X 和 Y 坐标然后根据将基因的贡献乘以 parameter/weight * 测量值后相加计算得出。前 4 个基因对 Y 值有贡献,而基因 5 和 6 决定 X 值。 wA - wF 是与其基因 A-F 对应物相关的 parameter/weights。
wA = .15
wB = .25
wC = .35
wD = .45
wE = .50
wF = .60
n=0
for n in range (5):
y1 = df.iat[0,n]
y2 = df.iat[1,n]
y3 = df.iat[2,n]
y4 = df.iat[3,n]
TrueY = wA*y1+wB*y2+wC*y3+wD*y4
x1 = df.iat[4,n]
x2 = df.iat[5,n]
TrueX = (wE*x1+wF*x2)
result = (TrueX, TrueY)
n += 1
label = f"({TrueX},{TrueY})"
plt.scatter(TrueX, TrueY, alpha=0.5)
plt.annotate(label, (TrueX,TrueY), textcoords="offset points", xytext=(0,10), ha='center')
因此我们计算所有坐标并绘制它们
Plot
我现在想做的是找出如何优化 wA-wF parameter/weights 以便将健康样本推向绘图的原点,比方说 (0.0),同时不健康的样本被推向一个合理的相反点,比方说 (1,1)。我调查了 K-means/SVM,但作为一个 novice-coder/biochemist,我完全不知所措,非常感谢任何可用的帮助。
这是一个使用 scipy.optimize
并结合您的代码的示例。 (由于您的代码包含一些语法和类型错误,我不得不进行小幅更正。)
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
df = pd.DataFrame(np.array([[1, 1.2, 1.4, 2, 2],
[1.5, 1, 1.4, 1.3, 1.2],
[1, 1.2, 1.6, 2, 1.4],
[1.7, 1.5, 1.5, 1.5, 1.4],
[1.6, 1.9, 1.8, 3, 2.5],
[2, 2.2, 1.9, 2, 2]]),
columns=['Healthy 1', 'Healthy 2', 'Healthy 3', 'Unhealthy 1', 'Unhealthy 2'],
index=[['A', 'B', 'C', 'D', 'E', 'F']])
wA = .15
wB = .25
wC = .35
wD = .45
wE = .50
wF = .60
from scipy.optimize import minimize
# use your given weights as the initial guess
w0 = np.array([wA, wB, wC, wD, wE, wF])
# the objective function to be minimized
# - it computes the (square of) the samples' distances to (0,0) resp. (1,1)
def fun(w):
weighted = df.values*w[:, None] # multiply all sample values by their weight
y = sum(weighted[:4]) # compute all 5 "TrueY" coordinates
x = sum(weighted[4:]) # compute all 5 "TrueX" coordinates
y[3:] -= 1 # adjust the "Unhealthy" y to the target (x,1)
x[3:] -= 1 # adjust the "Unhealthy" x to the target (1,y)
return sum(x**2+y**2) # return the sum of (squared) distances
res = minimize(fun, w0)
print(res)
# assign the optimized weights back to your parameters
wA, wB, wC, wD, wE, wF = res.x
# this is mostly your unchanged code
for n in range (5):
y1 = df.iat[0,n]
y2 = df.iat[1,n]
y3 = df.iat[2,n]
y4 = df.iat[3,n]
TrueY = wA*y1+wB*y2+wC*y3+wD*y4
x1 = df.iat[4,n]
x2 = df.iat[5,n]
TrueX = (wE*x1+wF*x2)
result = (TrueX, TrueY)
label = f"({TrueX:.3f},{TrueY:.3f})"
plt.scatter(TrueX, TrueY, alpha=0.5)
plt.annotate(label, (TrueX,TrueY), textcoords="offset points", xytext=(0,10), ha='center')
plt.savefig("mygraph.png")
这会产生参数 [ 1.21773653, 0.22185886, -0.39377451, -0.76513658, 0.86984207, -0.73166533]
作为解数组。由此我们可以看到健康样本聚集在 (0,0) 周围,不健康样本聚集在 (1,1) 周围:
您可能想尝试其他优化方法 - 请参阅 scipy.optimize.minimize
。
对于特定的基因评分系统,我想建立一个基本图,以便根据多个基因测量值,输入的新样本值立即被吸引到图中的健康或不健康组。假设我们有 5 个人,每人测量了 6 个基因。
Import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
df = pd.DataFrame(np.array([[A, 1, 1.2, 1.4, 2, 2], [B, 1.5, 1, 1.4, 1.3, 1.2], [C, 1, 1.2, 1.6, 2, 1.4], [D, 1.7, 1.5, 1.5, 1.5, 1.4], [E, 1.6, 1.9, 1.8, 3, 2.5], [F, 2, 2.2, 1.9, 2, 2]]), columns=['Gene', 'Healthy 1', 'Healthy 2', 'Healthy 3', 'Unhealthy 1', 'Unhealthy 2'])
这将创建以下 table:
Gene | Healthy 1 | Healthy 2 | Healthy 3 | Unhealthy 1 | Unhealthy 2 |
---|---|---|---|---|---|
A | 1.0 | 1.2 | 1.4 | 2.0 | 2.0 |
B | 1.5 | 1.0 | 1.4 | 1.3 | 1.2 |
C | 1.0 | 1.2 | 1.6 | 2.0 | 1.4 |
D | 1.7 | 1.5 | 1.5 | 1.5 | 1.4 |
E | 1.6 | 1.9 | 1.8 | 3.0 | 2.5 |
F | 2.0 | 2.2 | 1.9 | 2.0 | 2.0 |
每个样本的 X 和 Y 坐标然后根据将基因的贡献乘以 parameter/weight * 测量值后相加计算得出。前 4 个基因对 Y 值有贡献,而基因 5 和 6 决定 X 值。 wA - wF 是与其基因 A-F 对应物相关的 parameter/weights。
wA = .15
wB = .25
wC = .35
wD = .45
wE = .50
wF = .60
n=0
for n in range (5):
y1 = df.iat[0,n]
y2 = df.iat[1,n]
y3 = df.iat[2,n]
y4 = df.iat[3,n]
TrueY = wA*y1+wB*y2+wC*y3+wD*y4
x1 = df.iat[4,n]
x2 = df.iat[5,n]
TrueX = (wE*x1+wF*x2)
result = (TrueX, TrueY)
n += 1
label = f"({TrueX},{TrueY})"
plt.scatter(TrueX, TrueY, alpha=0.5)
plt.annotate(label, (TrueX,TrueY), textcoords="offset points", xytext=(0,10), ha='center')
因此我们计算所有坐标并绘制它们
Plot
我现在想做的是找出如何优化 wA-wF parameter/weights 以便将健康样本推向绘图的原点,比方说 (0.0),同时不健康的样本被推向一个合理的相反点,比方说 (1,1)。我调查了 K-means/SVM,但作为一个 novice-coder/biochemist,我完全不知所措,非常感谢任何可用的帮助。
这是一个使用 scipy.optimize
并结合您的代码的示例。 (由于您的代码包含一些语法和类型错误,我不得不进行小幅更正。)
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
df = pd.DataFrame(np.array([[1, 1.2, 1.4, 2, 2],
[1.5, 1, 1.4, 1.3, 1.2],
[1, 1.2, 1.6, 2, 1.4],
[1.7, 1.5, 1.5, 1.5, 1.4],
[1.6, 1.9, 1.8, 3, 2.5],
[2, 2.2, 1.9, 2, 2]]),
columns=['Healthy 1', 'Healthy 2', 'Healthy 3', 'Unhealthy 1', 'Unhealthy 2'],
index=[['A', 'B', 'C', 'D', 'E', 'F']])
wA = .15
wB = .25
wC = .35
wD = .45
wE = .50
wF = .60
from scipy.optimize import minimize
# use your given weights as the initial guess
w0 = np.array([wA, wB, wC, wD, wE, wF])
# the objective function to be minimized
# - it computes the (square of) the samples' distances to (0,0) resp. (1,1)
def fun(w):
weighted = df.values*w[:, None] # multiply all sample values by their weight
y = sum(weighted[:4]) # compute all 5 "TrueY" coordinates
x = sum(weighted[4:]) # compute all 5 "TrueX" coordinates
y[3:] -= 1 # adjust the "Unhealthy" y to the target (x,1)
x[3:] -= 1 # adjust the "Unhealthy" x to the target (1,y)
return sum(x**2+y**2) # return the sum of (squared) distances
res = minimize(fun, w0)
print(res)
# assign the optimized weights back to your parameters
wA, wB, wC, wD, wE, wF = res.x
# this is mostly your unchanged code
for n in range (5):
y1 = df.iat[0,n]
y2 = df.iat[1,n]
y3 = df.iat[2,n]
y4 = df.iat[3,n]
TrueY = wA*y1+wB*y2+wC*y3+wD*y4
x1 = df.iat[4,n]
x2 = df.iat[5,n]
TrueX = (wE*x1+wF*x2)
result = (TrueX, TrueY)
label = f"({TrueX:.3f},{TrueY:.3f})"
plt.scatter(TrueX, TrueY, alpha=0.5)
plt.annotate(label, (TrueX,TrueY), textcoords="offset points", xytext=(0,10), ha='center')
plt.savefig("mygraph.png")
这会产生参数 [ 1.21773653, 0.22185886, -0.39377451, -0.76513658, 0.86984207, -0.73166533]
作为解数组。由此我们可以看到健康样本聚集在 (0,0) 周围,不健康样本聚集在 (1,1) 周围:
您可能想尝试其他优化方法 - 请参阅 scipy.optimize.minimize
。