如何进行矩阵计算以获得变量的叉积

How to do a matrix calculation to get the cross products of variables

我在 R 中有一些数据,其中有很多列。请以下面为例

x = replicate(5, rnorm(10)) 
colnames(x) = c('a','b','c','d','e')

我想计算每个组合的叉积和比率,并将它们附加到 table 的末尾。我还想给它们命名,以便它们与它们的计算结果相关

结果应该有像这样的额外列:

cp_a_b,
cp_a_c,
cp_a_d,
cp_a_e,
cp_b_c,
cp_b_d,
cp_b_e,
cp_c_d,
cp_c_e,
cp_d_e,
ratio_a_b,
ratio_a_c,
ratio_a_d,
ratio_a_e,
ratio_b_c,
ratio_b_d,
ratio_b_e,
ratio_c_d,
ratio_c_e,
ratio_d_e,

其中 cp 是叉积,ratio 是两列的比率 我想把它作为矩阵计算来做,所以它很快而不是循环

我在 R 方面还是个新手,但无论如何这里还是要尝试一下。为了娱乐!我不知道它是否有任何希望很快。大概是太天真了吧...

首先是一个示例矩阵 x num_observations x num_features 的小随机整数。

num_features <- 5
num_observations <- 20
features <- letters[1:num_features]

x <- replicate(num_features, sample(1:10, num_observations, replace = T))

colnames(x) <- features

特征对的所有组合:

combinations <- combn(features, 2)
num_combinations = ncol(combinations)

对于每个特征对,我们将乘以 x 中的相应列。为新矩阵保留 space,其中相乘的列将结束:

y <- matrix(NA, ncol = num_combinations, nrow = num_observations)
cn <- rep("?", num_combinations) # column names of new features

乘以列组合:

for (i in 1:num_combinations)
{
  cn[i] <- paste(combinations[1,i], combinations[2,i], sep = ".")
  y[,i] <- x[,combinations[1,i]] * x[,combinations[2,i]]
}
colnames(y) <- cn

最终合并原始矩阵和附加特征:

x <- cbind(x, y)

为了简单起见,这只处理乘法,使用除法创建的其他特征当然是相似的。

更新

@nongkrong 在评论中建议的一个很好的方法放弃了显式循环,而只是:

y <- combn(split(x, col(x)), 2, FUN = function(cols) cols[[1]] * cols[[2]])
x <- cbind(x, y)

它没有明确设置新功能的列名,但更优雅,更易读。在一些快速的时间里,我做的也快了大约 30%!

基于 WhiteViking 和 bunk 的答案,下面是添加列名称的代码:

set.seed(1)
x = replicate(5, rnorm(10))
colnames(x) = c('a','b','c','d','e')

mult <- combn(split(x, col(x)), 2, FUN = function(cols) cols[[1]] * cols[[2]])
colnames(mult) <-paste("cp",combn(colnames(x), 2L, paste, collapse = "_"),sep="_")
ratio <- combn(split(x, col(x)), 2, FUN = function(cols) cols[[1]] / cols[[2]])
colnames(ratio) <-paste("ratio",combn(colnames(x), 2L, paste, collapse = "_"),sep="_")
cbind(x,mult,ratio)

> cbind(x,mult,ratio)
            a        b        c        d       e    cp_a_b   cp_a_c   cp_a_d
 [1,] -0.6265  1.51178  0.91898  1.35868 -0.1645 -0.947061 -0.57570 -0.85115
 [2,]  0.1836  0.38984  0.78214 -0.10279 -0.2534  0.071592  0.14363 -0.01888
 [3,] -0.8356 -0.62124  0.07456  0.38767  0.6970  0.519126 -0.06231 -0.32395
 [4,]  1.5953 -2.21470 -1.98935 -0.05381  0.5567 -3.533068 -3.17357 -0.08583
 [5,]  0.3295  1.12493  0.61983 -1.37706 -0.6888  0.370673  0.20424 -0.45375
 [6,] -0.8205 -0.04493 -0.05613 -0.41499 -0.7075  0.036867  0.04605  0.34049
 [7,]  0.4874 -0.01619 -0.15580 -0.39429  0.3646 -0.007892 -0.07594 -0.19219
 [8,]  0.7383  0.94384 -1.47075 -0.05931  0.7685  0.696858 -1.08589 -0.04379
 [9,]  0.5758  0.82122 -0.47815  1.10003 -0.1123  0.472844 -0.27531  0.63337
[10,] -0.3054  0.59390  0.41794  0.76318  0.8811 -0.181371 -0.12763 -0.23307
        cp_a_e    cp_b_c    cp_b_d    cp_b_e   cp_c_d   cp_c_e   cp_d_e
 [1,]  0.10307  1.389293  2.054026 -0.248724  1.24860 -0.15119 -0.22353
 [2,] -0.04653  0.304911 -0.040071 -0.098771 -0.08039 -0.19816  0.02604
 [3,] -0.58240 -0.046323 -0.240837 -0.432982  0.02891  0.05197  0.27019
 [4,]  0.88803  4.405817  0.119162 -1.232842  0.10704 -1.10740 -0.02995
 [5,] -0.22695  0.697261 -1.549097 -0.774803 -0.85354 -0.42691  0.94846
 [6,]  0.58048  0.002522  0.018647  0.031790  0.02329  0.03971  0.29361
 [7,]  0.17771  0.002522  0.006384 -0.005903  0.06143 -0.05680 -0.14375
 [8,]  0.56743 -1.388149 -0.055982  0.725369  0.08724 -1.13032 -0.04558
 [9,] -0.06469 -0.392667  0.903364 -0.092261 -0.52598  0.05372 -0.12358
[10,] -0.26908  0.248216  0.453251  0.523291  0.31896  0.36825  0.67244
      ratio_a_b ratio_a_c ratio_a_d ratio_a_e ratio_b_c ratio_b_d ratio_b_e
 [1,]   -0.4144   -0.6817   -0.4611    3.8077    1.6451   1.11268  -9.18884
 [2,]    0.4711    0.2348   -1.7866   -0.7248    0.4984  -3.79270  -1.53868
 [3,]    1.3451  -11.2067   -2.1555   -1.1990   -8.3315  -1.60249  -0.89135
 [4,]   -0.7203   -0.8019  -29.6493    2.8658    1.1133  41.16157  -3.97853
 [5,]    0.2929    0.5316   -0.2393   -0.4784    1.8149  -0.81691  -1.63328
 [6,]   18.2596   14.6176    1.9771    1.1597    0.8005   0.10828   0.06351
 [7,]  -30.1063   -3.1286   -1.2362    1.3370    0.1039   0.04106  -0.04441
 [8,]    0.7823   -0.5020  -12.4479    0.9607   -0.6417 -15.91270   1.22810
 [9,]    0.7011   -1.2042    0.5234   -5.1251   -1.7175   0.74655  -7.30974
[10,]   -0.5142   -0.7307   -0.4002   -0.3466    1.4210   0.77820   0.67404
      ratio_c_d ratio_c_e ratio_d_e
 [1,]    0.6764  -5.58569  -8.25827
 [2,]   -7.6092  -3.08703   0.40570
 [3,]    0.1923   0.10699   0.55623
 [4,]   36.9733  -3.57371  -0.09666
 [5,]   -0.4501  -0.89992   1.99934
 [6,]    0.1353   0.07933   0.58657
 [7,]    0.3951  -0.42733  -1.08149
 [8,]   24.7963  -1.91371  -0.07718
 [9,]   -0.4347   4.25604  -9.79139
[10,]    0.5476   0.47434   0.86615

这是一个dplyr/tidyr答案

library(dplyr)
library(tidyr)

wide_data = 
  x %>%
  as.data.frame %>%
  mutate(row = 1:n())

prefix = function(dataframe, prefix)
  dataframe %>%
  setNames(names(.) %>% paste(prefix, . , sep = "_")))

long_data =
  wide_data %>%
  gather(column, value, -row)

long_data %>% prefix("first") %>%
  merge(long_data %>% prefix("second")) %>%
  mutate(product = first_value * second_value,
         ratio = second_value / first_value) %>%
  select(-first_value, -second_value) %>%
  gather(measure, value, product, ratio) %>%
  unite(new_column, measure, first_column, second_column, sep = "_") %>%
  spread(new_column, value) %>%
  left_join(wide_data %>% prefix("first")) %>%
  left_join(wide_data %>% prefix("second"))