percentiles pandas vs. scala 错误在哪里?
percentiles pandas vs. scala where is the bug?
对于数字列表
val numbers = Seq(0.0817381355303346, 0.08907955219917718, 0.10581384008994665, 0.10970915785902469, 0.1530743353025532, 0.16728932033107657, 0.181932212814931, 0.23200826752868853, 0.2339654613723784, 0.2581657775305527, 0.3481071101229365, 0.5010850992326521, 0.6153244818101578, 0.6233250409474894, 0.6797744231690304, 0.6923891392381571, 0.7440316016776881, 0.7593186414698002, 0.8028091068764153, 0.8780699052482807, 0.8966649331194205)
python / pandas
计算以下百分位数:
25% 0.167289
50% 0.348107
75% 0.692389
但是,scala returns:
calcPercentiles(Seq(.25, .5, .75), sortedNumber.toArray)
25% 0.1601818278168149
50% 0.3481071101229365
75% 0.7182103704579226
数字几乎匹配 - 但不同。我怎样才能消除差异(并且很可能修复我的 scala 代码中的错误?
val sortedNumber = numbers.sorted
import scala.collection.mutable
case class PercentileResult(percentile:Double, value:Double)
// https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/stats/DescriptiveStats.scala#L537
def calculatePercentile(arr: Array[Double], p: Double)={
// +1 so that the .5 == mean for even number of elements.
val f = (arr.length + 1) * p
val i = f.toInt
if (i == 0) arr.head
else if (i >= arr.length) arr.last
else {
arr(i - 1) + (f - i) * (arr(i) - arr(i - 1))
}
}
def calcPercentiles(percentiles:Seq[Double], arr: Array[Double]):Array[PercentileResult] = {
val results = new mutable.ListBuffer[PercentileResult]
percentiles.foreach(p => {
val r = PercentileResult(percentile = p, value = calculatePercentile(arr, p))
results.append(r)
})
results.toArray
}
python:
import pandas as pd
df = pd.DataFrame({'foo':[0.0817381355303346, 0.08907955219917718, 0.10581384008994665, 0.10970915785902469, 0.1530743353025532, 0.16728932033107657, 0.181932212814931, 0.23200826752868853, 0.2339654613723784, 0.2581657775305527, 0.3481071101229365, 0.5010850992326521, 0.6153244818101578, 0.6233250409474894, 0.6797744231690304, 0.6923891392381571, 0.7440316016776881, 0.7593186414698002, 0.8028091068764153, 0.8780699052482807, 0.8966649331194205]})
display(df.head())
df.describe()
经过反复试验,我编写了这段代码,returns 与 Panda 的结果相同(使用线性插值,因为这是 pandas 默认值):
def calculatePercentile(numbers: Seq[Double], p: Double): Double = {
// interpolate only - no special handling of the case when rank is integer
val rank = (numbers.size - 1) * p
val i = numbers(math.floor(rank).toInt)
val j = numbers(math.ceil(rank).toInt)
val fraction = rank - math.floor(rank)
i + (j - i) * fraction
}
据此我会说错误在这里:
(arr.length + 1) * p
0 的百分位数应为 0,100% 的百分位数应为最大索引。
因此,对于 numbers
(.size == 21
),这将是索引 0
和 20
。但是,对于 100%,您将获得索引值 22 - 大于数组的大小!如果不是这些保护条款:
else if (i >= arr.length) arr.last
你会得到错误,你可能会怀疑哪里出了问题。也许代码的作者:
https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/stats/DescriptiveStats.scala#L537
使用了不同的百分位定义...(?)或者他们可能只是有一个错误。我说不出来。
顺便说一句:这个:
def calcPercentiles(percentiles:Seq[Double], arr: Array[Double]): Array[PercentileResult]
这样写会容易得多:
def calcPercentiles(percentiles:Seq[Double], numbers: Seq[Double]): Seq[PercentileResult] =
percentiles.map { p =>
PercentileResult(p, calculatePercentile(numbers, p))
}
对于数字列表
val numbers = Seq(0.0817381355303346, 0.08907955219917718, 0.10581384008994665, 0.10970915785902469, 0.1530743353025532, 0.16728932033107657, 0.181932212814931, 0.23200826752868853, 0.2339654613723784, 0.2581657775305527, 0.3481071101229365, 0.5010850992326521, 0.6153244818101578, 0.6233250409474894, 0.6797744231690304, 0.6923891392381571, 0.7440316016776881, 0.7593186414698002, 0.8028091068764153, 0.8780699052482807, 0.8966649331194205)
python / pandas
计算以下百分位数:
25% 0.167289
50% 0.348107
75% 0.692389
但是,scala returns:
calcPercentiles(Seq(.25, .5, .75), sortedNumber.toArray)
25% 0.1601818278168149
50% 0.3481071101229365
75% 0.7182103704579226
数字几乎匹配 - 但不同。我怎样才能消除差异(并且很可能修复我的 scala 代码中的错误?
val sortedNumber = numbers.sorted
import scala.collection.mutable
case class PercentileResult(percentile:Double, value:Double)
// https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/stats/DescriptiveStats.scala#L537
def calculatePercentile(arr: Array[Double], p: Double)={
// +1 so that the .5 == mean for even number of elements.
val f = (arr.length + 1) * p
val i = f.toInt
if (i == 0) arr.head
else if (i >= arr.length) arr.last
else {
arr(i - 1) + (f - i) * (arr(i) - arr(i - 1))
}
}
def calcPercentiles(percentiles:Seq[Double], arr: Array[Double]):Array[PercentileResult] = {
val results = new mutable.ListBuffer[PercentileResult]
percentiles.foreach(p => {
val r = PercentileResult(percentile = p, value = calculatePercentile(arr, p))
results.append(r)
})
results.toArray
}
python:
import pandas as pd
df = pd.DataFrame({'foo':[0.0817381355303346, 0.08907955219917718, 0.10581384008994665, 0.10970915785902469, 0.1530743353025532, 0.16728932033107657, 0.181932212814931, 0.23200826752868853, 0.2339654613723784, 0.2581657775305527, 0.3481071101229365, 0.5010850992326521, 0.6153244818101578, 0.6233250409474894, 0.6797744231690304, 0.6923891392381571, 0.7440316016776881, 0.7593186414698002, 0.8028091068764153, 0.8780699052482807, 0.8966649331194205]})
display(df.head())
df.describe()
经过反复试验,我编写了这段代码,returns 与 Panda 的结果相同(使用线性插值,因为这是 pandas 默认值):
def calculatePercentile(numbers: Seq[Double], p: Double): Double = {
// interpolate only - no special handling of the case when rank is integer
val rank = (numbers.size - 1) * p
val i = numbers(math.floor(rank).toInt)
val j = numbers(math.ceil(rank).toInt)
val fraction = rank - math.floor(rank)
i + (j - i) * fraction
}
据此我会说错误在这里:
(arr.length + 1) * p
0 的百分位数应为 0,100% 的百分位数应为最大索引。
因此,对于 numbers
(.size == 21
),这将是索引 0
和 20
。但是,对于 100%,您将获得索引值 22 - 大于数组的大小!如果不是这些保护条款:
else if (i >= arr.length) arr.last
你会得到错误,你可能会怀疑哪里出了问题。也许代码的作者:
https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/stats/DescriptiveStats.scala#L537
使用了不同的百分位定义...(?)或者他们可能只是有一个错误。我说不出来。
顺便说一句:这个:
def calcPercentiles(percentiles:Seq[Double], arr: Array[Double]): Array[PercentileResult]
这样写会容易得多:
def calcPercentiles(percentiles:Seq[Double], numbers: Seq[Double]): Seq[PercentileResult] =
percentiles.map { p =>
PercentileResult(p, calculatePercentile(numbers, p))
}