使用 R 中的权重来考虑采样概率的倒数

Question

这类似于但不等于。

我有一个长数据框，这是真实数据的一部分：

age gender labour_situation industry_code FACT FACT_2....
35  M      unemployed       15            1510
21  F      inactive         00            651

FACT 是一个变量，对于第一行，意味着 35 岁的男性失业者代表人口中的 1510 人。

我需要获取一些表格来显示相关信息，例如就业人口和失业人口的百分比等。在 Stata 中，有一些选项，例如 tab labour_situation [w=FACT]，显示人口中就业人口和失业人口的数量而 tab labour_situation 显示样本中的就业人数和失业人数。

部分解决方案可能是将数据框的第一行重复 1510 次，然后将数据框的第二行重复 651 次？正如我搜索的那样，一个选项是运行

longdata <- data[rep(1:nrow(data), data$FACT), ]
employment_table = with(longdata, addmargins(table(labour_situation, useNA = "ifany")))

我需要做的另一件事是运行考虑到以下列方式进行整群抽样的回归：人口被划分为不同的区域。这就产生了一个问题：一个人 $region_1$ 中的受访者代表 $p_1$ 人，而 $region_2$ 中的受访者代表 $p_2$ 人，但 $p_1$ 和 $p_2$ 与总数不成比例每个地区的人口，因此一些地区的人数过多，而另一些地区的人数不足。为了考虑到这一点，每个观察值都应按其被抽样概率的倒数加权。

最后一段意味着模型 $y_i=\beta_ix_i+u_i$ 可以用有效方程估计 $\beta=(XX\)^{-1}(X'y\)$ 但是方差-协方差矩阵不会 $\Sigma=\frac{1}{m-k-1}(u'u\)(X'X\)^{-1}$ 但是 $\Sigma=\frac{m}{m-k-1}(X'X\)^{-1}(X'WX\)(X'X\)^{-1}$ 如果我考虑采样概率的倒数。

在 Stata 中，可以通过 reg y x1 x2 [pweight=n] 进行运行回归，并考虑采样概率的倒数来计算正确的方差-协方差矩阵。当时我的部分工作必须使用 Stata，而其他部分则使用 R。我只想使用 R.

Answer 1

您可以通过重复行名来做到这一点：

df1 <- df[rep(row.names(df), df$FACT), 1:5]

> head(df1)
    age gender labour_situation industry_code FACT
1    35      M       unemployed            15 1510
1.1  35      M       unemployed            15 1510
1.2  35      M       unemployed            15 1510
1.3  35      M       unemployed            15 1510
1.4  35      M       unemployed            15 1510
1.5  35      M       unemployed            15 1510
> tail(df1)
      age gender labour_situation industry_code FACT
2.781  21      F         inactive             0  787
2.782  21      F         inactive             0  787
2.783  21      F         inactive             0  787
2.784  21      F         inactive             0  787
2.785  21      F         inactive             0  787
2.786  21      F         inactive             0  787

此处1:5指的是要保留的列。如果将该位留空，将返回所有内容。

使用 R 中的权重来考虑采样概率的倒数

Using weights in R to consider the inverse of sampling probability

regression

r

dataframe

stata