在函数上使用 groupby
Using groupby on a function
我有一个代码可以计算 x 和 y 变量的斜率(theil-sen 斜率),我想 运行 根据它们的组在值列表中使用它。我的文件如下所示:
station_id year Sum
210018 1917 329.946
210018 1918 442.214
210018 1919 562.864
210018 1920 396.748
210018 1921 604.266
210019 1917 400.946
210019 1918 442.214
210019 1919 600.864
210019 1920 250.748
210019 1921 100.266
我的输出应该是:
210018: -117189.92, 61.29
210019: 164382, -85.45
我使用的代码是:
def theil_sen(x,y):
n = len(x)
ord = numpy.argsort(x)
xs = x[ord]
ys = y[ord]
vec1 = numpy.zeros( (n,n) )
for ii in range(n):
for jj in range(n):
vec1[ii,jj] = ys[ii]-ys[jj]
vec2 = numpy.zeros( (n,n) )
for ii in range(n):
for jj in range(n):
vec2[ii,jj] = xs[ii]-xs[jj]
v1 = vec1[vec2>0]
v2 = vec2[vec2>0]
slope = numpy.median( v1/v2 )
coef = numpy.zeros( (2,1) )
b_0 = numpy.median(y)-slope*numpy.median(x)
b_1 = slope
res = y-b_1*x-b_0 # residuals
return (b_0,b_1,res)
stat=df.groupby(['station_id']).apply(lambda x: theil_sen(x['year'], x['Sum']))
print stat
所以 year
是我的 x 变量,Sum
是我的 y 变量。代码对站 210018 正确执行,但对于 210019 它 returns nan。任何帮助将不胜感激。
numpy.argsort(x)
与 pandas 系列进行折腾。它没有按预期工作,在第一组之后,因为索引不再是 0-n。而是在 x, y
Numpy 数组上工作。
这有效。
def theil_sen(x,y):
x = x.values
y = y.values
n = len(x)
ord = numpy.argsort(x)
xs = x[ord]
ys = y[ord]
vec1 = numpy.zeros( (n,n) )
for ii in range(n):
for jj in range(n):
vec1[ii,jj] = ys[ii]-ys[jj]
vec2 = numpy.zeros( (n,n) )
for ii in range(n):
for jj in range(n):
vec2[ii,jj] = xs[ii]-xs[jj]
v1 = vec1[vec2>0]
v2 = vec2[vec2>0]
slope = numpy.median( v1/v2 )
coef = numpy.zeros( (2,1) )
b_0 = numpy.median(y)-slope*numpy.median(x)
b_1 = slope
res = y-b_1*x-b_0 # residuals
return (b_0,b_1,res)
stat=df.groupby(['station_id']).apply(lambda x: theil_sen(x['year'], x['Sum']))
print stat
station_id
210018 (-117189.927333, 61.2986666667, [10.3293333333...
210019 (164382.3745, -85.4515, [-170.903, -44.1835, 1...
dtype: object
这两行只是对现有功能的补充。
x = x.values
y = y.values
然后,现在,让我们看看发生了什么错误,当您在系列对象的第一组之后应用 np.argsort() 时。让我们采用第二组值。这是 -
In [70]: x
Out[70]:
5 1917
6 1918
7 1919
8 1920
9 1921
Name: year, dtype: int64
In [71]: numpy.argsort(x)
Out[71]:
5 0
6 1
7 2
8 3
9 4
Name: year, dtype: int64
In [72]: x[numpy.argsort(x)]
Out[72]:
year
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
Name: year, dtype: float64
因为 ord
总是来自 [0-n]
,x[ord]
对于后面的组 returns NaN
值显然。
我有一个代码可以计算 x 和 y 变量的斜率(theil-sen 斜率),我想 运行 根据它们的组在值列表中使用它。我的文件如下所示:
station_id year Sum
210018 1917 329.946
210018 1918 442.214
210018 1919 562.864
210018 1920 396.748
210018 1921 604.266
210019 1917 400.946
210019 1918 442.214
210019 1919 600.864
210019 1920 250.748
210019 1921 100.266
我的输出应该是:
210018: -117189.92, 61.29
210019: 164382, -85.45
我使用的代码是:
def theil_sen(x,y):
n = len(x)
ord = numpy.argsort(x)
xs = x[ord]
ys = y[ord]
vec1 = numpy.zeros( (n,n) )
for ii in range(n):
for jj in range(n):
vec1[ii,jj] = ys[ii]-ys[jj]
vec2 = numpy.zeros( (n,n) )
for ii in range(n):
for jj in range(n):
vec2[ii,jj] = xs[ii]-xs[jj]
v1 = vec1[vec2>0]
v2 = vec2[vec2>0]
slope = numpy.median( v1/v2 )
coef = numpy.zeros( (2,1) )
b_0 = numpy.median(y)-slope*numpy.median(x)
b_1 = slope
res = y-b_1*x-b_0 # residuals
return (b_0,b_1,res)
stat=df.groupby(['station_id']).apply(lambda x: theil_sen(x['year'], x['Sum']))
print stat
所以 year
是我的 x 变量,Sum
是我的 y 变量。代码对站 210018 正确执行,但对于 210019 它 returns nan。任何帮助将不胜感激。
numpy.argsort(x)
与 pandas 系列进行折腾。它没有按预期工作,在第一组之后,因为索引不再是 0-n。而是在 x, y
Numpy 数组上工作。
这有效。
def theil_sen(x,y):
x = x.values
y = y.values
n = len(x)
ord = numpy.argsort(x)
xs = x[ord]
ys = y[ord]
vec1 = numpy.zeros( (n,n) )
for ii in range(n):
for jj in range(n):
vec1[ii,jj] = ys[ii]-ys[jj]
vec2 = numpy.zeros( (n,n) )
for ii in range(n):
for jj in range(n):
vec2[ii,jj] = xs[ii]-xs[jj]
v1 = vec1[vec2>0]
v2 = vec2[vec2>0]
slope = numpy.median( v1/v2 )
coef = numpy.zeros( (2,1) )
b_0 = numpy.median(y)-slope*numpy.median(x)
b_1 = slope
res = y-b_1*x-b_0 # residuals
return (b_0,b_1,res)
stat=df.groupby(['station_id']).apply(lambda x: theil_sen(x['year'], x['Sum']))
print stat
station_id
210018 (-117189.927333, 61.2986666667, [10.3293333333...
210019 (164382.3745, -85.4515, [-170.903, -44.1835, 1...
dtype: object
这两行只是对现有功能的补充。
x = x.values
y = y.values
然后,现在,让我们看看发生了什么错误,当您在系列对象的第一组之后应用 np.argsort() 时。让我们采用第二组值。这是 -
In [70]: x
Out[70]:
5 1917
6 1918
7 1919
8 1920
9 1921
Name: year, dtype: int64
In [71]: numpy.argsort(x)
Out[71]:
5 0
6 1
7 2
8 3
9 4
Name: year, dtype: int64
In [72]: x[numpy.argsort(x)]
Out[72]:
year
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
Name: year, dtype: float64
因为 ord
总是来自 [0-n]
,x[ord]
对于后面的组 returns NaN
值显然。