幂律拟合的 r2 值问题:这个拟合真的比直线差吗?
Problem with r2 value of power law fit: Is this fit really worse than a straight line?
我有一个包含 x 和 y 数据的数据框。数据表现出一般的幂律形式,所以我想拟合一条幂律曲线。然而,当我这样做时,我得到一个 -2.63 的 r2 值,这告诉我拟合比直线更差。真的是这样吗?
我使用两种方法(r_squared
和 r_squared_2
)计算了 r2。我不确定问题是出在我如何拟合幂律曲线还是出在我的 r2 计算上。
import numpy as np
import matplotlib.pyplot as plt
from math import exp
import pandas as pd
from scipy.optimize import curve_fit
new_dat = pd.DataFrame( {'x': {0: 133.39072904912717, 1: 138.24394099399626, 2: 123.68098616912548, 3: 145.53622634264102, 4: 115.66589965763, 5: 102.3954554988245, 6: 188.36808402142134, 7: 159.973167750876, 8: 109.27288573117258, 9: 228.48970913145482, 10: 59.72149772767079, 11: 107.86289489093632, 12: 139.54636990829943, 13: 69.16804084300782, 14: 128.51492415467243, 15: 123.89886969748194, 16: 96.69524458890069, 17: 179.4218007796204, 18: 81.94248858920511, 19: 116.57431987139303, 20: 86.72287597716091, 21: 104.26504167186982, 22: 96.21176975617014, 23: 113.05563002252855, 24: 95.13881793216328, 25: 90.24566440833108, 26: 120.21979837370618, 27: 148.02989788213065, 28: 131.5536333505709, 29: 43.98432257846345, 30: 151.20808505875556, 31: 106.90408749635041, 32: 208.84439653547977, 33: 141.93620845530992, 34: 66.06470015823503, 35: 144.26451450341665, 36: 268.44231416009114, 37: 104.21592558477657, 38: 87.4647314243362, 39: 21.62506288172477, 40: 211.8288449343543, 41: 137.1783782430448, 42: 152.68656578316114, 43: 71.40444647539057, 44: 138.26429570303063, 45: 195.8195134445166, 46: 65.0580543537033, 47: 91.53609270183331, 48: 133.93031838426649, 49: 130.18323679275105}, 'y': {0: 3385.9107941013963, 1: 3767.4773129933837, 2: 3393.972804533385, 3: 3207.540799419189, 4: 3503.971988612639, 5: 2699.582157811891, 6: 3472.2197615815303, 7: 3734.8682981154525, 8: 3015.1595391710443, 9: 3833.7005103694264, 10: 2180.8813084622725, 11: 3057.1175212715566, 12: 3322.2694622283707, 13: 2625.095843511092, 14: 3428.9305902296073, 15: 3665.3597140080483, 16: 3348.2359174389712, 17: 3203.650823344419, 18: 2314.649577797234, 19: 3445.142411753858, 20: 2854.0698989716257, 21: 3224.426663700497, 22: 2975.4529990214733, 23: 2830.8849349346683, 24: 2757.178895276296, 25: 2804.842233145504, 26: 2580.295378480375, 27: 3451.028240314123, 28: 3559.855598644374, 29: 1682.632983470442, 30: 3573.6640120241777, 31: 2612.5922620115434, 32: 3047.6869797329296, 33: 3500.1611748529945, 34: 2976.7358839883286, 35: 3270.15016432246, 36: 3702.39276797799, 37: 3174.024034000559, 38: 3116.0557991571313, 39: 2277.6763663475185, 40: 2907.349510347204, 41: 2959.5225286559644, 42: 3523.744356032963, 43: 2793.4503330781213, 44: 3688.929929188237, 45: 3654.619681532315, 46: 2077.6749192493166, 47: 2692.8596392079253, 48: 3365.5547117446195, 49: 3357.554384166426}} )
x,y = new_dat.x, new_dat.y
def power_law(x, a, b):
return a*np.power(x, b)
# find best fit
popt, pcov = curve_fit(power_law, x, y)
# plot data and best fit curve.
plt.plot(x, y,'ok')
x = np.linspace(5, 300, len(x))
plt.plot(x, power_law(x, *popt),'r-',markersize=3, linewidth=2.5)
# get r2 via method 1
residuals = y - power_law(x, *popt)
ss_res = np.sum(residuals**2)
ss_tot = np.sum((y-np.mean(y))**2)
r_squared = 1 - (ss_res / ss_tot)
# get r2 via method 2
from sklearn.metrics import r2_score
r_squared_2 = r2_score(y, power_law(x, *popt), multioutput='variance_weighted')
# plot labels
x_lab, y_lab, title = 'X data', 'Y data', 'Power law fit'
plt.xlabel(x_lab)
plt.ylabel(y_lab)
plt.title(title, fontweight="bold")
#ss_res / ss_tot
#print(ss_res, ss_tot)
#print(popt[0],popt[1])
# print both r2 methods
print('r2 from method 1 is:',r_squared)
print('r2 from methood 2 is:',r_squared_2)
p.s。我知道数据看起来是线性的,但我希望能够拟合幂律分布,因为实际数据点的数量远大于 len(x)。
因为您的数据对于较小的 x 值是线性的,我建议拟合 y^2as something like
y^2 = a x + b, or, if you prefer, y = (ax+b)^c (I am positing
1/2`,但这可能不是是正确的)。
看看这些行:
popt, pcov = curve_fit(power_law, x, y)
# plot data and best fit curve.
plt.plot(x, y,'ok')
x = np.linspace(5, 300, len(x))
plt.plot(x, power_law(x, *popt),'r-',markersize=3, linewidth=2.5)
首先,您将自己的观点与原始 x
数据相匹配。不过在那之后,x
被 np.linspace
覆盖了。检查文档:该方法会更改您的原始值!
尝试将这些行更改为:
popt, pcov = curve_fit(power_law, x, y)
# plot data and best fit curve.
plt.plot(x, y,'ok')
plt.plot(np.sort(x), power_law(np.sort(x), *popt),'r-',markersize=3, linewidth=2.5)
您会看到 r2 增加到 0.54。如果您认为它仍然太低,您可能想尝试使用更多数据点(如果有的话),或者如已经建议的那样,更改曲线以适合。
P.S:我使用 np.sort() 来正确绘制拟合曲线。如果不使用,曲线会弄乱情节(自己尝试 :D)。
我有一个包含 x 和 y 数据的数据框。数据表现出一般的幂律形式,所以我想拟合一条幂律曲线。然而,当我这样做时,我得到一个 -2.63 的 r2 值,这告诉我拟合比直线更差。真的是这样吗?
我使用两种方法(r_squared
和 r_squared_2
)计算了 r2。我不确定问题是出在我如何拟合幂律曲线还是出在我的 r2 计算上。
import numpy as np
import matplotlib.pyplot as plt
from math import exp
import pandas as pd
from scipy.optimize import curve_fit
new_dat = pd.DataFrame( {'x': {0: 133.39072904912717, 1: 138.24394099399626, 2: 123.68098616912548, 3: 145.53622634264102, 4: 115.66589965763, 5: 102.3954554988245, 6: 188.36808402142134, 7: 159.973167750876, 8: 109.27288573117258, 9: 228.48970913145482, 10: 59.72149772767079, 11: 107.86289489093632, 12: 139.54636990829943, 13: 69.16804084300782, 14: 128.51492415467243, 15: 123.89886969748194, 16: 96.69524458890069, 17: 179.4218007796204, 18: 81.94248858920511, 19: 116.57431987139303, 20: 86.72287597716091, 21: 104.26504167186982, 22: 96.21176975617014, 23: 113.05563002252855, 24: 95.13881793216328, 25: 90.24566440833108, 26: 120.21979837370618, 27: 148.02989788213065, 28: 131.5536333505709, 29: 43.98432257846345, 30: 151.20808505875556, 31: 106.90408749635041, 32: 208.84439653547977, 33: 141.93620845530992, 34: 66.06470015823503, 35: 144.26451450341665, 36: 268.44231416009114, 37: 104.21592558477657, 38: 87.4647314243362, 39: 21.62506288172477, 40: 211.8288449343543, 41: 137.1783782430448, 42: 152.68656578316114, 43: 71.40444647539057, 44: 138.26429570303063, 45: 195.8195134445166, 46: 65.0580543537033, 47: 91.53609270183331, 48: 133.93031838426649, 49: 130.18323679275105}, 'y': {0: 3385.9107941013963, 1: 3767.4773129933837, 2: 3393.972804533385, 3: 3207.540799419189, 4: 3503.971988612639, 5: 2699.582157811891, 6: 3472.2197615815303, 7: 3734.8682981154525, 8: 3015.1595391710443, 9: 3833.7005103694264, 10: 2180.8813084622725, 11: 3057.1175212715566, 12: 3322.2694622283707, 13: 2625.095843511092, 14: 3428.9305902296073, 15: 3665.3597140080483, 16: 3348.2359174389712, 17: 3203.650823344419, 18: 2314.649577797234, 19: 3445.142411753858, 20: 2854.0698989716257, 21: 3224.426663700497, 22: 2975.4529990214733, 23: 2830.8849349346683, 24: 2757.178895276296, 25: 2804.842233145504, 26: 2580.295378480375, 27: 3451.028240314123, 28: 3559.855598644374, 29: 1682.632983470442, 30: 3573.6640120241777, 31: 2612.5922620115434, 32: 3047.6869797329296, 33: 3500.1611748529945, 34: 2976.7358839883286, 35: 3270.15016432246, 36: 3702.39276797799, 37: 3174.024034000559, 38: 3116.0557991571313, 39: 2277.6763663475185, 40: 2907.349510347204, 41: 2959.5225286559644, 42: 3523.744356032963, 43: 2793.4503330781213, 44: 3688.929929188237, 45: 3654.619681532315, 46: 2077.6749192493166, 47: 2692.8596392079253, 48: 3365.5547117446195, 49: 3357.554384166426}} )
x,y = new_dat.x, new_dat.y
def power_law(x, a, b):
return a*np.power(x, b)
# find best fit
popt, pcov = curve_fit(power_law, x, y)
# plot data and best fit curve.
plt.plot(x, y,'ok')
x = np.linspace(5, 300, len(x))
plt.plot(x, power_law(x, *popt),'r-',markersize=3, linewidth=2.5)
# get r2 via method 1
residuals = y - power_law(x, *popt)
ss_res = np.sum(residuals**2)
ss_tot = np.sum((y-np.mean(y))**2)
r_squared = 1 - (ss_res / ss_tot)
# get r2 via method 2
from sklearn.metrics import r2_score
r_squared_2 = r2_score(y, power_law(x, *popt), multioutput='variance_weighted')
# plot labels
x_lab, y_lab, title = 'X data', 'Y data', 'Power law fit'
plt.xlabel(x_lab)
plt.ylabel(y_lab)
plt.title(title, fontweight="bold")
#ss_res / ss_tot
#print(ss_res, ss_tot)
#print(popt[0],popt[1])
# print both r2 methods
print('r2 from method 1 is:',r_squared)
print('r2 from methood 2 is:',r_squared_2)
p.s。我知道数据看起来是线性的,但我希望能够拟合幂律分布,因为实际数据点的数量远大于 len(x)。
因为您的数据对于较小的 x 值是线性的,我建议拟合 y^2as something like
y^2 = a x + b, or, if you prefer, y = (ax+b)^c (I am positing
1/2`,但这可能不是是正确的)。
看看这些行:
popt, pcov = curve_fit(power_law, x, y)
# plot data and best fit curve.
plt.plot(x, y,'ok')
x = np.linspace(5, 300, len(x))
plt.plot(x, power_law(x, *popt),'r-',markersize=3, linewidth=2.5)
首先,您将自己的观点与原始 x
数据相匹配。不过在那之后,x
被 np.linspace
覆盖了。检查文档:该方法会更改您的原始值!
尝试将这些行更改为:
popt, pcov = curve_fit(power_law, x, y)
# plot data and best fit curve.
plt.plot(x, y,'ok')
plt.plot(np.sort(x), power_law(np.sort(x), *popt),'r-',markersize=3, linewidth=2.5)
您会看到 r2 增加到 0.54。如果您认为它仍然太低,您可能想尝试使用更多数据点(如果有的话),或者如已经建议的那样,更改曲线以适合。
P.S:我使用 np.sort() 来正确绘制拟合曲线。如果不使用,曲线会弄乱情节(自己尝试 :D)。