R 中分位数的定义

Question

主要问题：假设你有一个离散的、有限的数据集$d$。然后命令 summary(d) returns 最小值、第一个四分位数、中位数、平均值、第三个四分位数和最大值。我的问题是：R 使用什么公式来计算第一个四分位数？

背景：我的数据集是：d=c(1,2,3,3,4,9)。 summary(d) returns 2.25 作为第一个四分位数。现在，计算第一个四分位数的一种方法是选择一个值 q1，使得 25% 的数据集小于或等于 q1。显然，这不是 R 使用的。所以，我想知道，R 使用什么公式来计算第一个四分位数？

关于这个主题的谷歌搜索让我更加困惑，我找不到 R 使用的公式。在 R 中输入 help(summary) 对我也没有帮助。

Answer 1

一般讨论:

样本分位数函数有多种不同的可能性；我们希望它们具有各种属性（包括易于理解和解释！），并且根据我们最想要的属性，我们可能更喜欢不同的定义。

因此，它们之间的各种包使用许多不同的定义。

Hyndman 和 Fan [1] 的论文给出了样本分位数函数的六个理想属性，列出了分位数函数的九个现有定义，并提到了哪些（在许多常见的）包中使用了哪些定义。它的介绍说（抱歉，这句话中的数学不再正确呈现，因为它已移至 SO）：

the sample quantiles that are used in statistical packages are all based on one or two order statistics, and can be written as

\hat{Q}_i(p) = (1 - γ) X_{(j)} + γ X_{(j+1)}\,,
where \frac{j-m}{n}\leq p< \frac{j-m+1}{n} \quad (1)

for some m\in \mathbb{R} and 0\leq\gamma\leq 1.

也就是说，一般来说，样本分位数可以写成两个相邻顺序统计量的某种加权平均（尽管可能只有其中一个有权重）。

在 R:

特别是，R 提供了 Hyndman & Fan 中提到的所有九种定义（默认为 $7$）。从 Hyndman & Fan 我们看到：

Definition 7. Gumbel (1939) also considered the modal position $p_k = \text{mode}\,F(X_{(k)}) = (k-l)/(n-1)$. One nice property is that the vertices of $Q_7(p)$ divide the range into $n-1$ intervals, and exactly 0p\%$ of the intervals lie to the left of $Q_7(p$) and 0(1-p)\%$ of the intervals lie to the right of $Q_7(p)$.

这是什么意思？考虑 n=9。那么对于(k-1)/(n-1) = 0.25，你需要k = 1+(9-1)/4 = 3。即，下四分位数是 9 的第 3 个观测值。

我们可以在R中看到：

quantile(1:9)
  0%  25%  50%  75% 100% 
   1    3    5    7    9

对于当 n 不是 4k+1 形式时的行为，最简单的方法是尝试它：

> quantile(1:10)
   0%   25%   50%   75%  100% 
 1.00  3.25  5.50  7.75 10.00 
> quantile(1:11)
  0%  25%  50%  75% 100% 
 1.0  3.5  6.0  8.5 11.0 
> quantile(1:12)
   0%   25%   50%   75%  100% 
 1.00  3.75  6.50  9.25 12.00

当 k 不是整数时，它会取相邻顺序统计数据的加权平均值，与它们之间的分数成比例（即 linear interpolation）。

好消息是，平均而言，高于第一个四分位数的观测值是低于第一个四分位数的观测值的 3 倍。因此，例如，对于 9 个观测值，您在第三个观测值上方得到 6 个，在第三个观测值下方得到 2 个，这将它们分成比率 3:1。

您的示例数据发生了什么

你有 d=c(1,2,3,3,4,9)，所以 n 是 6。你需要 (k-1)/(n-1) 才能成为 0.25，所以 k = 1 + 5/4 = 2.25。也就是说，它在第二个和第三个观察值（巧合的是它们本身是 2 和 3）之间占了 25% 的距离，所以下四分位数是 2+0.25*(3-2) = 2.25.

引擎盖下：一些 R 细节：

当您在数据框上调用 summary 时，这会导致 summary.data.frame 应用于数据框（即您调用的 class 的相关 summary它在）。 summary.

的帮助中提到了它的存在

summary.data.frame 函数（最终——通过 summary.default 应用于每一列）调用 quantile 来计算四分位数（不幸的是，你不会在帮助中看到这个，因为?summary.data.frame 只是将您带到 summary 帮助，而不会向您提供有关将 summary 应用于数字向量时会发生什么的任何详细信息——这是真正糟糕的地方之一在帮助中）。

因此 ?quantile（或 help(quantile)）描述了 R 的作用。

它说了两件事（直接基于 Hyndman & Fan）。首先，它给出了一般信息：

All sample quantiles are defined as weighted averages of consecutive order statistics. Sample quantiles of type i are defined by:

Q[i](p) = (1 - γ) x[j] + γ x[j+1],

where 1 ≤ i ≤ 9, (j-m)/n ≤ p < (j-m+1)/n, x[j] is the jth order statistic, n is the sample size, the value of γ is a function of j = floor(np + m) and g = np + m - j, and m is a constant determined by the sample quantile type.

其次，关于方法7的具体信息：

Type 7
</code> <code>m = 1-p

</code> <code>. p[k] = (k - 1) / (n - 1). In this case, p[k] = mode[F(x[k])]. This is used by S.

希望我之前给出的解释有助于更好地理解这句话的意思。就定义而言，quantile 上的帮助几乎只引用了 Hyndman & Fan，其行为非常简单。

参考:

[1]: Rob J. Hyndman 和 Yanan Fan (1996),
"Sample Quantiles in Statistical Packages,"
美国统计学家，卷。 50，第 4 期（11 月），第 361-365 页

另请参阅讨论 here。

R 中分位数的定义

Definitions of quantiles in R

r

quantile