如何为 tidyverse 中的所有列创建交互列?
How do I create interaction columns forall columns in tidyverse?
我正在尝试为数据框中的所有 20 个变量创建交互变量,因此总共有 20 个基本变量和 380 个交互变量。对于任何单个变量,我可以使用以下方法创建包含 19 个变量的数据框:
in_sample[3:22] %>%
transmute(across(.cols = -c(frpm_frac_s), .fns = function(x){x*frpm_frac_s}))
但是我无法遍历列。我尝试在列名向量上使用 map,但无法让 map 中的函数读取 as.symbol(character)。
这是我来自 dput 的数据示例:
structure(list(frpm_frac_s = c(0.870400011539459, 0.904699981212616,
0.98089998960495, 0.838800013065338, 0.919900000095367, 0.837700009346008,
0.84799998998642, 0.925999999046326, 0.963900029659271, 0.887899994850159
), enrollment_s = c(364, 608, 571, 705, 566, 838, 421, 757, 693,
535), ell_frac_s = c(0.46000000834465, 0.334000021219254, 0.300999999046326,
0.209999993443489, 0.706999957561493, 0.552999973297119, 0.412999987602234,
0.359000027179718, 0.726000010967255, 0.646999955177307), edi_s = c(8,
38, 39, 37, 11, 35, 15, 39, 9, 4), te_fte_s = c(23, 22, 20, 25,
24.5, 36, 18, 30.2999992370605, 24.3999996185303, 19)), row.names = c(NA,
10L), class = "data.frame")
使用时:
in_sample[3:22] %>%
transmute(across(.cols = -c(frpm_frac_s), .fns = function(x){x*frpm_frac_s}))
我得到:
structure(list(enrollment_s = c(316.825604200363, 550.057588577271,
560.093894064426, 591.354009211063, 520.663400053978, 701.992607831955,
357.007995784283, 700.981999278069, 667.982720553875, 475.026497244835
), ell_frac_s = c(0.400384012571335, 0.302169812922072, 0.295250895935631,
0.17614799724412, 0.650369261028242, 0.463248082799339, 0.350223985351086,
0.33243402482605, 0.699791432103968, 0.574471256869984), edi_s = c(6.96320009231567,
34.3785992860794, 38.255099594593, 31.0356004834175, 10.118900001049,
29.3195003271103, 12.7199998497963, 36.1139999628067, 8.67510026693344,
3.55159997940063), te_fte_s = c(20.0192002654076, 19.9033995866776,
19.617999792099, 20.9700003266335, 22.5375500023365, 30.1572003364563,
15.2639998197556, 28.0577992646217, 23.5191603559875, 16.870099902153
)), row.names = c(NA, 10L), class = "data.frame")
我想对所有变量执行此操作,然后将它们绑定在一起。
感谢您的帮助。
您可以使用 model.matrix
创建交互项。 (这是大多数建模函数的幕后工作。)
m = model.matrix(~ .^2 - . + 0, data = df)
m
# frpm_frac_s:enrollment_s frpm_frac_s:ell_frac_s frpm_frac_s:edi_s frpm_frac_s:te_fte_s
# 1 316.8256 0.4003840 6.9632 20.01920
# 2 550.0576 0.3021698 34.3786 19.90340
# 3 560.0939 0.2952509 38.2551 19.61800
# 4 591.3540 0.1761480 31.0356 20.97000
# 5 520.6634 0.6503693 10.1189 22.53755
# 6 701.9926 0.4632481 29.3195 30.15720
# 7 357.0080 0.3502240 12.7200 15.26400
# 8 700.9820 0.3324340 36.1140 28.05780
# 9 667.9827 0.6997914 8.6751 23.51916
# 10 475.0265 0.5744713 3.5516 16.87010
# enrollment_s:ell_frac_s enrollment_s:edi_s enrollment_s:te_fte_s ell_frac_s:edi_s
# 1 167.440 2912 8372.0 3.680
# 2 203.072 23104 13376.0 12.692
# 3 171.871 22269 11420.0 11.739
# 4 148.050 26085 17625.0 7.770
# 5 400.162 6226 13867.0 7.777
# 6 463.414 29330 30168.0 19.355
# 7 173.873 6315 7578.0 6.195
# 8 271.763 29523 22937.1 14.001
# 9 503.118 6237 16909.2 6.534
# 10 346.145 2140 10165.0 2.588
# ell_frac_s:te_fte_s edi_s:te_fte_s
# 1 10.5800 184.0
# 2 7.3480 836.0
# 3 6.0200 780.0
# 4 5.2500 925.0
# 5 17.3215 269.5
# 6 19.9080 1260.0
# 7 7.4340 270.0
# 8 10.8777 1181.7
# 9 17.7144 219.6
# 10 12.2930 76.0
# attr(,"assign")
# [1] 1 2 3 4 5 6 7 8 9 10
你的数学有点不对劲,因为乘法中的顺序无关紧要,有 n * (n - 1) / 2
种可能性(与 n choose 2
相同),所以你应该期望 20 列输入有 190 列输出.
我将公式设为 仅 包含交互项,您也可以使用 ~ .^2 + 0
包含一阶项,或者 ~ .^2
也包含拦截。
我正在尝试为数据框中的所有 20 个变量创建交互变量,因此总共有 20 个基本变量和 380 个交互变量。对于任何单个变量,我可以使用以下方法创建包含 19 个变量的数据框:
in_sample[3:22] %>%
transmute(across(.cols = -c(frpm_frac_s), .fns = function(x){x*frpm_frac_s}))
但是我无法遍历列。我尝试在列名向量上使用 map,但无法让 map 中的函数读取 as.symbol(character)。 这是我来自 dput 的数据示例:
structure(list(frpm_frac_s = c(0.870400011539459, 0.904699981212616,
0.98089998960495, 0.838800013065338, 0.919900000095367, 0.837700009346008,
0.84799998998642, 0.925999999046326, 0.963900029659271, 0.887899994850159
), enrollment_s = c(364, 608, 571, 705, 566, 838, 421, 757, 693,
535), ell_frac_s = c(0.46000000834465, 0.334000021219254, 0.300999999046326,
0.209999993443489, 0.706999957561493, 0.552999973297119, 0.412999987602234,
0.359000027179718, 0.726000010967255, 0.646999955177307), edi_s = c(8,
38, 39, 37, 11, 35, 15, 39, 9, 4), te_fte_s = c(23, 22, 20, 25,
24.5, 36, 18, 30.2999992370605, 24.3999996185303, 19)), row.names = c(NA,
10L), class = "data.frame")
使用时:
in_sample[3:22] %>%
transmute(across(.cols = -c(frpm_frac_s), .fns = function(x){x*frpm_frac_s}))
我得到:
structure(list(enrollment_s = c(316.825604200363, 550.057588577271,
560.093894064426, 591.354009211063, 520.663400053978, 701.992607831955,
357.007995784283, 700.981999278069, 667.982720553875, 475.026497244835
), ell_frac_s = c(0.400384012571335, 0.302169812922072, 0.295250895935631,
0.17614799724412, 0.650369261028242, 0.463248082799339, 0.350223985351086,
0.33243402482605, 0.699791432103968, 0.574471256869984), edi_s = c(6.96320009231567,
34.3785992860794, 38.255099594593, 31.0356004834175, 10.118900001049,
29.3195003271103, 12.7199998497963, 36.1139999628067, 8.67510026693344,
3.55159997940063), te_fte_s = c(20.0192002654076, 19.9033995866776,
19.617999792099, 20.9700003266335, 22.5375500023365, 30.1572003364563,
15.2639998197556, 28.0577992646217, 23.5191603559875, 16.870099902153
)), row.names = c(NA, 10L), class = "data.frame")
我想对所有变量执行此操作,然后将它们绑定在一起。 感谢您的帮助。
您可以使用 model.matrix
创建交互项。 (这是大多数建模函数的幕后工作。)
m = model.matrix(~ .^2 - . + 0, data = df)
m
# frpm_frac_s:enrollment_s frpm_frac_s:ell_frac_s frpm_frac_s:edi_s frpm_frac_s:te_fte_s
# 1 316.8256 0.4003840 6.9632 20.01920
# 2 550.0576 0.3021698 34.3786 19.90340
# 3 560.0939 0.2952509 38.2551 19.61800
# 4 591.3540 0.1761480 31.0356 20.97000
# 5 520.6634 0.6503693 10.1189 22.53755
# 6 701.9926 0.4632481 29.3195 30.15720
# 7 357.0080 0.3502240 12.7200 15.26400
# 8 700.9820 0.3324340 36.1140 28.05780
# 9 667.9827 0.6997914 8.6751 23.51916
# 10 475.0265 0.5744713 3.5516 16.87010
# enrollment_s:ell_frac_s enrollment_s:edi_s enrollment_s:te_fte_s ell_frac_s:edi_s
# 1 167.440 2912 8372.0 3.680
# 2 203.072 23104 13376.0 12.692
# 3 171.871 22269 11420.0 11.739
# 4 148.050 26085 17625.0 7.770
# 5 400.162 6226 13867.0 7.777
# 6 463.414 29330 30168.0 19.355
# 7 173.873 6315 7578.0 6.195
# 8 271.763 29523 22937.1 14.001
# 9 503.118 6237 16909.2 6.534
# 10 346.145 2140 10165.0 2.588
# ell_frac_s:te_fte_s edi_s:te_fte_s
# 1 10.5800 184.0
# 2 7.3480 836.0
# 3 6.0200 780.0
# 4 5.2500 925.0
# 5 17.3215 269.5
# 6 19.9080 1260.0
# 7 7.4340 270.0
# 8 10.8777 1181.7
# 9 17.7144 219.6
# 10 12.2930 76.0
# attr(,"assign")
# [1] 1 2 3 4 5 6 7 8 9 10
你的数学有点不对劲,因为乘法中的顺序无关紧要,有 n * (n - 1) / 2
种可能性(与 n choose 2
相同),所以你应该期望 20 列输入有 190 列输出.
我将公式设为 仅 包含交互项,您也可以使用 ~ .^2 + 0
包含一阶项,或者 ~ .^2
也包含拦截。