R 相当于 Stata `tabulate , generate( )` 命令
R equivalent of Stata `tabulate , generate( )` command
我想在 R 中模仿 Stata 的 tabulate , generate()
命令的行为。如下图所示;该命令的功能是双重的。首先,在我的示例中,它产生单向 table 频率计数。其次,它使用选项 ,generate()
中声明的前缀 (stubname) 为变量 (var1
) 中包含的每个值生成虚拟变量,以命名生成的变量虚拟变量(d_1 - d_7
)。我的问题是关于第二个功能。首选R-base解决方案,但也欢迎打包依赖。
[编辑]:我的最终目标是生成一个 data.frame()
来模拟屏幕上打印的最后一个数据集。
clear all
input var1
0
1
2
2
2
2
42
42
777
888
999999
end
tabulate var1 ,gen(d_)
/* var1 | Freq. Percent Cum.
------------+-----------------------------------
0 | 1 9.09 9.09
1 | 1 9.09 18.18
2 | 4 36.36 54.55
42 | 2 18.18 72.73
777 | 1 9.09 81.82
888 | 1 9.09 90.91
999999 | 1 9.09 100.00
------------+-----------------------------------
Total | 11 100.00 */
list, sep(11)
/* +--------------------------------------------------+
| var1 d_1 d_2 d_3 d_4 d_5 d_6 d_7 |
|--------------------------------------------------|
1. | 0 1 0 0 0 0 0 0 |
2. | 1 0 1 0 0 0 0 0 |
3. | 2 0 0 1 0 0 0 0 |
4. | 2 0 0 1 0 0 0 0 |
5. | 2 0 0 1 0 0 0 0 |
6. | 2 0 0 1 0 0 0 0 |
7. | 42 0 0 0 1 0 0 0 |
8. | 42 0 0 0 1 0 0 0 |
9. | 777 0 0 0 0 1 0 0 |
10. | 888 0 0 0 0 0 1 0 |
11. | 999999 0 0 0 0 0 0 1 |
+--------------------------------------------------+ */
我猜你假设 var_1
中的每个值都是唯一的,这样你得到的是虚拟变量而不是 d_
字段中的计数。
您可以尝试这样的操作:
var1 <- 1:5
dummy_matrix <- vapply(var1, function(x) as.numeric(var1 == x), rep(1, 5)) # create a matrix of dummy vars
colnames(dummy_matrix) <- paste0("d_", var1) # name the columns
cbind(var1, dummy_matrix) # bind to var1
输出:
var1 d_1 d_2 d_3 d_4 d_5
1 1 1 0 0 0 0
2 2 0 1 0 0 0
3 3 0 0 1 0 0
4 4 0 0 0 1 0
5 5 0 0 0 0 1
set.seed(123)
df = data.frame(var1 = factor(sample(10, 20, TRUE)))
df = data.frame(df, model.matrix(~0+var1, df)) # 0 here is to suppress the intercept. The smallest value will be the base group--and hence will be dropped.
names(df)[-1] = paste0('d_', 1:(ncol(df)-1))
df
var1 d_1 d_2 d_3 d_4 d_5 d_6 d_7 d_8 d_9
1 3 0 1 0 0 0 0 0 0 0
2 3 0 1 0 0 0 0 0 0 0
3 10 0 0 0 0 0 0 0 0 1
4 2 1 0 0 0 0 0 0 0 0
5 6 0 0 0 0 1 0 0 0 0
6 5 0 0 0 1 0 0 0 0 0
7 4 0 0 1 0 0 0 0 0 0
8 6 0 0 0 0 1 0 0 0 0
9 9 0 0 0 0 0 0 0 1 0
10 10 0 0 0 0 0 0 0 0 1
11 5 0 0 0 1 0 0 0 0 0
12 3 0 1 0 0 0 0 0 0 0
13 9 0 0 0 0 0 0 0 1 0
14 9 0 0 0 0 0 0 0 1 0
15 9 0 0 0 0 0 0 0 1 0
16 3 0 1 0 0 0 0 0 0 0
17 8 0 0 0 0 0 0 1 0 0
18 10 0 0 0 0 0 0 0 0 1
19 7 0 0 0 0 0 1 0 0 0
20 10 0 0 0 0 0 0 0 0 1
我想在 R 中模仿 Stata 的 tabulate , generate()
命令的行为。如下图所示;该命令的功能是双重的。首先,在我的示例中,它产生单向 table 频率计数。其次,它使用选项 ,generate()
中声明的前缀 (stubname) 为变量 (var1
) 中包含的每个值生成虚拟变量,以命名生成的变量虚拟变量(d_1 - d_7
)。我的问题是关于第二个功能。首选R-base解决方案,但也欢迎打包依赖。
[编辑]:我的最终目标是生成一个 data.frame()
来模拟屏幕上打印的最后一个数据集。
clear all
input var1
0
1
2
2
2
2
42
42
777
888
999999
end
tabulate var1 ,gen(d_)
/* var1 | Freq. Percent Cum.
------------+-----------------------------------
0 | 1 9.09 9.09
1 | 1 9.09 18.18
2 | 4 36.36 54.55
42 | 2 18.18 72.73
777 | 1 9.09 81.82
888 | 1 9.09 90.91
999999 | 1 9.09 100.00
------------+-----------------------------------
Total | 11 100.00 */
list, sep(11)
/* +--------------------------------------------------+
| var1 d_1 d_2 d_3 d_4 d_5 d_6 d_7 |
|--------------------------------------------------|
1. | 0 1 0 0 0 0 0 0 |
2. | 1 0 1 0 0 0 0 0 |
3. | 2 0 0 1 0 0 0 0 |
4. | 2 0 0 1 0 0 0 0 |
5. | 2 0 0 1 0 0 0 0 |
6. | 2 0 0 1 0 0 0 0 |
7. | 42 0 0 0 1 0 0 0 |
8. | 42 0 0 0 1 0 0 0 |
9. | 777 0 0 0 0 1 0 0 |
10. | 888 0 0 0 0 0 1 0 |
11. | 999999 0 0 0 0 0 0 1 |
+--------------------------------------------------+ */
我猜你假设 var_1
中的每个值都是唯一的,这样你得到的是虚拟变量而不是 d_
字段中的计数。
您可以尝试这样的操作:
var1 <- 1:5
dummy_matrix <- vapply(var1, function(x) as.numeric(var1 == x), rep(1, 5)) # create a matrix of dummy vars
colnames(dummy_matrix) <- paste0("d_", var1) # name the columns
cbind(var1, dummy_matrix) # bind to var1
输出:
var1 d_1 d_2 d_3 d_4 d_5
1 1 1 0 0 0 0
2 2 0 1 0 0 0
3 3 0 0 1 0 0
4 4 0 0 0 1 0
5 5 0 0 0 0 1
set.seed(123)
df = data.frame(var1 = factor(sample(10, 20, TRUE)))
df = data.frame(df, model.matrix(~0+var1, df)) # 0 here is to suppress the intercept. The smallest value will be the base group--and hence will be dropped.
names(df)[-1] = paste0('d_', 1:(ncol(df)-1))
df
var1 d_1 d_2 d_3 d_4 d_5 d_6 d_7 d_8 d_9
1 3 0 1 0 0 0 0 0 0 0
2 3 0 1 0 0 0 0 0 0 0
3 10 0 0 0 0 0 0 0 0 1
4 2 1 0 0 0 0 0 0 0 0
5 6 0 0 0 0 1 0 0 0 0
6 5 0 0 0 1 0 0 0 0 0
7 4 0 0 1 0 0 0 0 0 0
8 6 0 0 0 0 1 0 0 0 0
9 9 0 0 0 0 0 0 0 1 0
10 10 0 0 0 0 0 0 0 0 1
11 5 0 0 0 1 0 0 0 0 0
12 3 0 1 0 0 0 0 0 0 0
13 9 0 0 0 0 0 0 0 1 0
14 9 0 0 0 0 0 0 0 1 0
15 9 0 0 0 0 0 0 0 1 0
16 3 0 1 0 0 0 0 0 0 0
17 8 0 0 0 0 0 0 1 0 0
18 10 0 0 0 0 0 0 0 0 1
19 7 0 0 0 0 0 1 0 0 0
20 10 0 0 0 0 0 0 0 0 1