Julia 中的基尼系数:高效准确的代码
Gini Coefficient in Julia: Efficient and Accurate Code
我正在尝试在 Julia 中实现以下公式来计算工资分配的 Gini coefficient:
其中
这是我为此使用的代码的简化版本:
# Takes a array where first column is value of wages
# (y_i in formula), and second column is probability
# of wage value (f(y_i) in formula).
function gini(wagedistarray)
# First calculate S values in formula
for i in 1:length(wagedistarray[:,1])
for j in 1:i
Swages[i]+=wagedistarray[j,2]*wagedistarray[j,1]
end
end
# Now calculate value to subtract from 1 in gini formula
Gwages = Swages[1]*wagedistarray[1,2]
for i in 2:length(Swages)
Gwages += wagedistarray[i,2]*(Swages[i]+Swages[i-1])
end
# Final step of gini calculation
return giniwages=1-(Gwages/Swages[length(Swages)])
end
wagedistarray=zeros(10000,2)
Swages=zeros(length(wagedistarray[:,1]))
for i in 1:length(wagedistarray[:,1])
wagedistarray[i,1]=1
wagedistarray[i,2]=1/10000
end
@time result=gini(wagedistarray)
它给出的值接近于零,这是您对完全平等的工资分配所期望的值。然而,它需要相当长的时间:6.796 秒。
有什么改进的想法吗?
试试这个:
function gini(wagedistarray)
nrows = size(wagedistarray,1)
Swages = zeros(nrows)
for i in 1:nrows
for j in 1:i
Swages[i] += wagedistarray[j,2]*wagedistarray[j,1]
end
end
Gwages=Swages[1]*wagedistarray[1,2]
for i in 2:nrows
Gwages+=wagedistarray[i,2]*(Swages[i]+Swages[i-1])
end
return 1-(Gwages/Swages[length(Swages)])
end
wagedistarray=zeros(10000,2)
for i in 1:size(wagedistarray,1)
wagedistarray[i,1]=1
wagedistarray[i,2]=1/10000
end
@time result=gini(wagedistarray)
- 之前时间:
5.913907256 seconds (4000481676 bytes allocated, 25.37% gc time)
- 之后的时间:
0.134799301 seconds (507260 bytes allocated)
- 之后的时间(秒 运行):
elapsed time: 0.123665107 seconds (80112 bytes allocated)
主要问题是 Swages
是一个全局变量(不存在于函数中),这不是一个好的编码习惯,但更重要的是 performance killer。我注意到的另一件事是 length(wagedistarray[:,1])
,它复制该列然后询问它的长度 - 这会生成一些额外的 "garbage"。第二个 运行 更快,因为函数第一次 运行.
需要一些编译时间
使用 @inbounds
,即
,您的曲柄性能会更高
function gini(wagedistarray)
nrows = size(wagedistarray,1)
Swages = zeros(nrows)
@inbounds for i in 1:nrows
for j in 1:i
Swages[i] += wagedistarray[j,2]*wagedistarray[j,1]
end
end
Gwages=Swages[1]*wagedistarray[1,2]
@inbounds for i in 2:nrows
Gwages+=wagedistarray[i,2]*(Swages[i]+Swages[i-1])
end
return 1-(Gwages/Swages[length(Swages)])
end
这给了我 elapsed time: 0.042070662 seconds (80112 bytes allocated)
最后看看这个版本,其实比所有版本都快,也是我认为最准的:
function gini2(wagedistarray)
Swages = cumsum(wagedistarray[:,1].*wagedistarray[:,2])
Gwages = Swages[1]*wagedistarray[1,2] +
sum(wagedistarray[2:end,2] .*
(Swages[2:end]+Swages[1:end-1]))
return 1 - Gwages/Swages[end]
end
其中有elapsed time: 0.00041119 seconds (721664 bytes allocated)
。主要好处是从 O(n^2) 双 for 循环更改为 O(n) cumsum
.
IainDunning 已经提供了一个很好的答案,其代码对于实际目的来说足够快(函数 gini2
)。如果喜欢性能调整,可以通过避免临时数组 (gini3
) 将速度额外提高 20 倍。请参阅以下比较两种实现的性能的代码:
using TimeIt
wagedistarray=zeros(10000,2)
for i in 1:size(wagedistarray,1)
wagedistarray[i,1]=1
wagedistarray[i,2]=1/10000
end
wages = wagedistarray[:,1]
wagefrequencies = wagedistarray[:,2];
# original code
function gini2(wagedistarray)
Swages = cumsum(wagedistarray[:,1].*wagedistarray[:,2])
Gwages = Swages[1]*wagedistarray[1,2] +
sum(wagedistarray[2:end,2] .*
(Swages[2:end]+Swages[1:end-1]))
return 1 - Gwages/Swages[end]
end
# new code
function gini3(wages, wagefrequencies)
Swages_previous = wages[1]*wagefrequencies[1]
Gwages = Swages_previous*wagefrequencies[1]
@inbounds for i = 2:length(wages)
freq = wagefrequencies[i]
Swages_current = Swages_previous + wages[i]*freq
Gwages += freq * (Swages_current+Swages_previous)
Swages_previous = Swages_current
end
return 1.0 - Gwages/Swages_previous
end
result=gini2(wagedistarray) # warming up JIT
println("result with gini2: $result, time:")
@timeit result=gini2(wagedistarray)
result=gini3(wages, wagefrequencies) # warming up JIT
println("result with gini3: $result, time:")
@timeit result=gini3(wages, wagefrequencies)
输出为:
result with gini2: 0.0, time:
1000 loops, best of 3: 321.57 µs per loop
result with gini3: -1.4210854715202004e-14, time:
10000 loops, best of 3: 16.24 µs per loop
由于顺序求和,gini3
比 gini2
准确一些,因此必须使用 pairwise summation 的变体来提高准确度。
我正在尝试在 Julia 中实现以下公式来计算工资分配的 Gini coefficient:
其中
这是我为此使用的代码的简化版本:
# Takes a array where first column is value of wages
# (y_i in formula), and second column is probability
# of wage value (f(y_i) in formula).
function gini(wagedistarray)
# First calculate S values in formula
for i in 1:length(wagedistarray[:,1])
for j in 1:i
Swages[i]+=wagedistarray[j,2]*wagedistarray[j,1]
end
end
# Now calculate value to subtract from 1 in gini formula
Gwages = Swages[1]*wagedistarray[1,2]
for i in 2:length(Swages)
Gwages += wagedistarray[i,2]*(Swages[i]+Swages[i-1])
end
# Final step of gini calculation
return giniwages=1-(Gwages/Swages[length(Swages)])
end
wagedistarray=zeros(10000,2)
Swages=zeros(length(wagedistarray[:,1]))
for i in 1:length(wagedistarray[:,1])
wagedistarray[i,1]=1
wagedistarray[i,2]=1/10000
end
@time result=gini(wagedistarray)
它给出的值接近于零,这是您对完全平等的工资分配所期望的值。然而,它需要相当长的时间:6.796 秒。
有什么改进的想法吗?
试试这个:
function gini(wagedistarray)
nrows = size(wagedistarray,1)
Swages = zeros(nrows)
for i in 1:nrows
for j in 1:i
Swages[i] += wagedistarray[j,2]*wagedistarray[j,1]
end
end
Gwages=Swages[1]*wagedistarray[1,2]
for i in 2:nrows
Gwages+=wagedistarray[i,2]*(Swages[i]+Swages[i-1])
end
return 1-(Gwages/Swages[length(Swages)])
end
wagedistarray=zeros(10000,2)
for i in 1:size(wagedistarray,1)
wagedistarray[i,1]=1
wagedistarray[i,2]=1/10000
end
@time result=gini(wagedistarray)
- 之前时间:
5.913907256 seconds (4000481676 bytes allocated, 25.37% gc time)
- 之后的时间:
0.134799301 seconds (507260 bytes allocated)
- 之后的时间(秒 运行):
elapsed time: 0.123665107 seconds (80112 bytes allocated)
主要问题是 Swages
是一个全局变量(不存在于函数中),这不是一个好的编码习惯,但更重要的是 performance killer。我注意到的另一件事是 length(wagedistarray[:,1])
,它复制该列然后询问它的长度 - 这会生成一些额外的 "garbage"。第二个 运行 更快,因为函数第一次 运行.
使用 @inbounds
,即
function gini(wagedistarray)
nrows = size(wagedistarray,1)
Swages = zeros(nrows)
@inbounds for i in 1:nrows
for j in 1:i
Swages[i] += wagedistarray[j,2]*wagedistarray[j,1]
end
end
Gwages=Swages[1]*wagedistarray[1,2]
@inbounds for i in 2:nrows
Gwages+=wagedistarray[i,2]*(Swages[i]+Swages[i-1])
end
return 1-(Gwages/Swages[length(Swages)])
end
这给了我 elapsed time: 0.042070662 seconds (80112 bytes allocated)
最后看看这个版本,其实比所有版本都快,也是我认为最准的:
function gini2(wagedistarray)
Swages = cumsum(wagedistarray[:,1].*wagedistarray[:,2])
Gwages = Swages[1]*wagedistarray[1,2] +
sum(wagedistarray[2:end,2] .*
(Swages[2:end]+Swages[1:end-1]))
return 1 - Gwages/Swages[end]
end
其中有elapsed time: 0.00041119 seconds (721664 bytes allocated)
。主要好处是从 O(n^2) 双 for 循环更改为 O(n) cumsum
.
IainDunning 已经提供了一个很好的答案,其代码对于实际目的来说足够快(函数 gini2
)。如果喜欢性能调整,可以通过避免临时数组 (gini3
) 将速度额外提高 20 倍。请参阅以下比较两种实现的性能的代码:
using TimeIt
wagedistarray=zeros(10000,2)
for i in 1:size(wagedistarray,1)
wagedistarray[i,1]=1
wagedistarray[i,2]=1/10000
end
wages = wagedistarray[:,1]
wagefrequencies = wagedistarray[:,2];
# original code
function gini2(wagedistarray)
Swages = cumsum(wagedistarray[:,1].*wagedistarray[:,2])
Gwages = Swages[1]*wagedistarray[1,2] +
sum(wagedistarray[2:end,2] .*
(Swages[2:end]+Swages[1:end-1]))
return 1 - Gwages/Swages[end]
end
# new code
function gini3(wages, wagefrequencies)
Swages_previous = wages[1]*wagefrequencies[1]
Gwages = Swages_previous*wagefrequencies[1]
@inbounds for i = 2:length(wages)
freq = wagefrequencies[i]
Swages_current = Swages_previous + wages[i]*freq
Gwages += freq * (Swages_current+Swages_previous)
Swages_previous = Swages_current
end
return 1.0 - Gwages/Swages_previous
end
result=gini2(wagedistarray) # warming up JIT
println("result with gini2: $result, time:")
@timeit result=gini2(wagedistarray)
result=gini3(wages, wagefrequencies) # warming up JIT
println("result with gini3: $result, time:")
@timeit result=gini3(wages, wagefrequencies)
输出为:
result with gini2: 0.0, time:
1000 loops, best of 3: 321.57 µs per loop
result with gini3: -1.4210854715202004e-14, time:
10000 loops, best of 3: 16.24 µs per loop
由于顺序求和,gini3
比 gini2
准确一些,因此必须使用 pairwise summation 的变体来提高准确度。