幂律拟合的 r2 值问题：这个拟合真的比直线差吗？

Question

我有一个包含 x 和 y 数据的数据框。数据表现出一般的幂律形式，所以我想拟合一条幂律曲线。然而，当我这样做时，我得到一个 -2.63 的 r2 值，这告诉我拟合比直线更差。真的是这样吗？

我使用两种方法（r_squared 和 r_squared_2）计算了 r2。我不确定问题是出在我如何拟合幂律曲线还是出在我的 r2 计算上。

import numpy as np
import matplotlib.pyplot as plt
from math import exp
import pandas as pd
from scipy.optimize import curve_fit

new_dat = pd.DataFrame( {'x': {0: 133.39072904912717, 1: 138.24394099399626, 2: 123.68098616912548, 3: 145.53622634264102, 4: 115.66589965763, 5: 102.3954554988245, 6: 188.36808402142134, 7: 159.973167750876, 8: 109.27288573117258, 9: 228.48970913145482, 10: 59.72149772767079, 11: 107.86289489093632, 12: 139.54636990829943, 13: 69.16804084300782, 14: 128.51492415467243, 15: 123.89886969748194, 16: 96.69524458890069, 17: 179.4218007796204, 18: 81.94248858920511, 19: 116.57431987139303, 20: 86.72287597716091, 21: 104.26504167186982, 22: 96.21176975617014, 23: 113.05563002252855, 24: 95.13881793216328, 25: 90.24566440833108, 26: 120.21979837370618, 27: 148.02989788213065, 28: 131.5536333505709, 29: 43.98432257846345, 30: 151.20808505875556, 31: 106.90408749635041, 32: 208.84439653547977, 33: 141.93620845530992, 34: 66.06470015823503, 35: 144.26451450341665, 36: 268.44231416009114, 37: 104.21592558477657, 38: 87.4647314243362, 39: 21.62506288172477, 40: 211.8288449343543, 41: 137.1783782430448, 42: 152.68656578316114, 43: 71.40444647539057, 44: 138.26429570303063, 45: 195.8195134445166, 46: 65.0580543537033, 47: 91.53609270183331, 48: 133.93031838426649, 49: 130.18323679275105}, 'y': {0: 3385.9107941013963, 1: 3767.4773129933837, 2: 3393.972804533385, 3: 3207.540799419189, 4: 3503.971988612639, 5: 2699.582157811891, 6: 3472.2197615815303, 7: 3734.8682981154525, 8: 3015.1595391710443, 9: 3833.7005103694264, 10: 2180.8813084622725, 11: 3057.1175212715566, 12: 3322.2694622283707, 13: 2625.095843511092, 14: 3428.9305902296073, 15: 3665.3597140080483, 16: 3348.2359174389712, 17: 3203.650823344419, 18: 2314.649577797234, 19: 3445.142411753858, 20: 2854.0698989716257, 21: 3224.426663700497, 22: 2975.4529990214733, 23: 2830.8849349346683, 24: 2757.178895276296, 25: 2804.842233145504, 26: 2580.295378480375, 27: 3451.028240314123, 28: 3559.855598644374, 29: 1682.632983470442, 30: 3573.6640120241777, 31: 2612.5922620115434, 32: 3047.6869797329296, 33: 3500.1611748529945, 34: 2976.7358839883286, 35: 3270.15016432246, 36: 3702.39276797799, 37: 3174.024034000559, 38: 3116.0557991571313, 39: 2277.6763663475185, 40: 2907.349510347204, 41: 2959.5225286559644, 42: 3523.744356032963, 43: 2793.4503330781213, 44: 3688.929929188237, 45: 3654.619681532315, 46: 2077.6749192493166, 47: 2692.8596392079253, 48: 3365.5547117446195, 49: 3357.554384166426}} )
x,y = new_dat.x, new_dat.y

def power_law(x, a, b):
    return a*np.power(x, b)

# find best fit
popt, pcov = curve_fit(power_law, x, y)

# plot data and best fit curve.
plt.plot(x, y,'ok')
x = np.linspace(5, 300, len(x))
plt.plot(x, power_law(x, *popt),'r-',markersize=3,  linewidth=2.5)

# get r2 via method 1
residuals = y - power_law(x, *popt)
ss_res = np.sum(residuals**2)
ss_tot = np.sum((y-np.mean(y))**2)
r_squared = 1 - (ss_res / ss_tot)

# get r2 via method 2
from sklearn.metrics import r2_score
r_squared_2 = r2_score(y, power_law(x, *popt), multioutput='variance_weighted')

# plot labels
x_lab, y_lab, title = 'X data', 'Y data', 'Power law fit'
plt.xlabel(x_lab)
plt.ylabel(y_lab)
plt.title(title, fontweight="bold")

#ss_res / ss_tot
#print(ss_res, ss_tot)
#print(popt[0],popt[1])

# print both r2 methods
print('r2 from method 1 is:',r_squared)
print('r2 from methood 2 is:',r_squared_2)

p.s。我知道数据看起来是线性的，但我希望能够拟合幂律分布，因为实际数据点的数量远大于 len(x)。

Answer 1

因为您的数据对于较小的 x 值是线性的，我建议拟合 y^2as something likey^2 = a x + b, or, if you prefer, y = (ax+b)^c (I am positing1/2`，但这可能不是是正确的）。

Answer 2

看看这些行：

popt, pcov = curve_fit(power_law, x, y)

# plot data and best fit curve.
plt.plot(x, y,'ok')
x = np.linspace(5, 300, len(x))
plt.plot(x, power_law(x, *popt),'r-',markersize=3,  linewidth=2.5)

首先，您将自己的观点与原始 x 数据相匹配。不过在那之后，x 被 np.linspace 覆盖了。检查文档：该方法会更改您的原始值！

尝试将这些行更改为：

popt, pcov = curve_fit(power_law, x, y)

# plot data and best fit curve.
plt.plot(x, y,'ok')
plt.plot(np.sort(x), power_law(np.sort(x), *popt),'r-',markersize=3,  linewidth=2.5)

您会看到 r2 增加到 0.54。如果您认为它仍然太低，您可能想尝试使用更多数据点（如果有的话），或者如已经建议的那样，更改曲线以适合。

P.S：我使用 np.sort() 来正确绘制拟合曲线。如果不使用，曲线会弄乱情节（自己尝试 :D）。

幂律拟合的 r2 值问题：这个拟合真的比直线差吗？

Problem with r2 value of power law fit: Is this fit really worse than a straight line?

python

numpy

scipy

power-law

pandas