如何进行矩阵计算以获得变量的叉积
How to do a matrix calculation to get the cross products of variables
我在 R 中有一些数据,其中有很多列。请以下面为例
x = replicate(5, rnorm(10))
colnames(x) = c('a','b','c','d','e')
我想计算每个组合的叉积和比率,并将它们附加到 table 的末尾。我还想给它们命名,以便它们与它们的计算结果相关
结果应该有像这样的额外列:
cp_a_b,
cp_a_c,
cp_a_d,
cp_a_e,
cp_b_c,
cp_b_d,
cp_b_e,
cp_c_d,
cp_c_e,
cp_d_e,
ratio_a_b,
ratio_a_c,
ratio_a_d,
ratio_a_e,
ratio_b_c,
ratio_b_d,
ratio_b_e,
ratio_c_d,
ratio_c_e,
ratio_d_e,
其中 cp 是叉积,ratio 是两列的比率
我想把它作为矩阵计算来做,所以它很快而不是循环
我在 R 方面还是个新手,但无论如何这里还是要尝试一下。为了娱乐!我不知道它是否有任何希望很快。大概是太天真了吧...
首先是一个示例矩阵 x
num_observations x num_features
的小随机整数。
num_features <- 5
num_observations <- 20
features <- letters[1:num_features]
x <- replicate(num_features, sample(1:10, num_observations, replace = T))
colnames(x) <- features
特征对的所有组合:
combinations <- combn(features, 2)
num_combinations = ncol(combinations)
对于每个特征对,我们将乘以 x
中的相应列。为新矩阵保留 space,其中相乘的列将结束:
y <- matrix(NA, ncol = num_combinations, nrow = num_observations)
cn <- rep("?", num_combinations) # column names of new features
乘以列组合:
for (i in 1:num_combinations)
{
cn[i] <- paste(combinations[1,i], combinations[2,i], sep = ".")
y[,i] <- x[,combinations[1,i]] * x[,combinations[2,i]]
}
colnames(y) <- cn
最终合并原始矩阵和附加特征:
x <- cbind(x, y)
为了简单起见,这只处理乘法,使用除法创建的其他特征当然是相似的。
更新
@nongkrong 在评论中建议的一个很好的方法放弃了显式循环,而只是:
y <- combn(split(x, col(x)), 2, FUN = function(cols) cols[[1]] * cols[[2]])
x <- cbind(x, y)
它没有明确设置新功能的列名,但更优雅,更易读。在一些快速的时间里,我做的也快了大约 30%!
基于 WhiteViking 和 bunk 的答案,下面是添加列名称的代码:
set.seed(1)
x = replicate(5, rnorm(10))
colnames(x) = c('a','b','c','d','e')
mult <- combn(split(x, col(x)), 2, FUN = function(cols) cols[[1]] * cols[[2]])
colnames(mult) <-paste("cp",combn(colnames(x), 2L, paste, collapse = "_"),sep="_")
ratio <- combn(split(x, col(x)), 2, FUN = function(cols) cols[[1]] / cols[[2]])
colnames(ratio) <-paste("ratio",combn(colnames(x), 2L, paste, collapse = "_"),sep="_")
cbind(x,mult,ratio)
> cbind(x,mult,ratio)
a b c d e cp_a_b cp_a_c cp_a_d
[1,] -0.6265 1.51178 0.91898 1.35868 -0.1645 -0.947061 -0.57570 -0.85115
[2,] 0.1836 0.38984 0.78214 -0.10279 -0.2534 0.071592 0.14363 -0.01888
[3,] -0.8356 -0.62124 0.07456 0.38767 0.6970 0.519126 -0.06231 -0.32395
[4,] 1.5953 -2.21470 -1.98935 -0.05381 0.5567 -3.533068 -3.17357 -0.08583
[5,] 0.3295 1.12493 0.61983 -1.37706 -0.6888 0.370673 0.20424 -0.45375
[6,] -0.8205 -0.04493 -0.05613 -0.41499 -0.7075 0.036867 0.04605 0.34049
[7,] 0.4874 -0.01619 -0.15580 -0.39429 0.3646 -0.007892 -0.07594 -0.19219
[8,] 0.7383 0.94384 -1.47075 -0.05931 0.7685 0.696858 -1.08589 -0.04379
[9,] 0.5758 0.82122 -0.47815 1.10003 -0.1123 0.472844 -0.27531 0.63337
[10,] -0.3054 0.59390 0.41794 0.76318 0.8811 -0.181371 -0.12763 -0.23307
cp_a_e cp_b_c cp_b_d cp_b_e cp_c_d cp_c_e cp_d_e
[1,] 0.10307 1.389293 2.054026 -0.248724 1.24860 -0.15119 -0.22353
[2,] -0.04653 0.304911 -0.040071 -0.098771 -0.08039 -0.19816 0.02604
[3,] -0.58240 -0.046323 -0.240837 -0.432982 0.02891 0.05197 0.27019
[4,] 0.88803 4.405817 0.119162 -1.232842 0.10704 -1.10740 -0.02995
[5,] -0.22695 0.697261 -1.549097 -0.774803 -0.85354 -0.42691 0.94846
[6,] 0.58048 0.002522 0.018647 0.031790 0.02329 0.03971 0.29361
[7,] 0.17771 0.002522 0.006384 -0.005903 0.06143 -0.05680 -0.14375
[8,] 0.56743 -1.388149 -0.055982 0.725369 0.08724 -1.13032 -0.04558
[9,] -0.06469 -0.392667 0.903364 -0.092261 -0.52598 0.05372 -0.12358
[10,] -0.26908 0.248216 0.453251 0.523291 0.31896 0.36825 0.67244
ratio_a_b ratio_a_c ratio_a_d ratio_a_e ratio_b_c ratio_b_d ratio_b_e
[1,] -0.4144 -0.6817 -0.4611 3.8077 1.6451 1.11268 -9.18884
[2,] 0.4711 0.2348 -1.7866 -0.7248 0.4984 -3.79270 -1.53868
[3,] 1.3451 -11.2067 -2.1555 -1.1990 -8.3315 -1.60249 -0.89135
[4,] -0.7203 -0.8019 -29.6493 2.8658 1.1133 41.16157 -3.97853
[5,] 0.2929 0.5316 -0.2393 -0.4784 1.8149 -0.81691 -1.63328
[6,] 18.2596 14.6176 1.9771 1.1597 0.8005 0.10828 0.06351
[7,] -30.1063 -3.1286 -1.2362 1.3370 0.1039 0.04106 -0.04441
[8,] 0.7823 -0.5020 -12.4479 0.9607 -0.6417 -15.91270 1.22810
[9,] 0.7011 -1.2042 0.5234 -5.1251 -1.7175 0.74655 -7.30974
[10,] -0.5142 -0.7307 -0.4002 -0.3466 1.4210 0.77820 0.67404
ratio_c_d ratio_c_e ratio_d_e
[1,] 0.6764 -5.58569 -8.25827
[2,] -7.6092 -3.08703 0.40570
[3,] 0.1923 0.10699 0.55623
[4,] 36.9733 -3.57371 -0.09666
[5,] -0.4501 -0.89992 1.99934
[6,] 0.1353 0.07933 0.58657
[7,] 0.3951 -0.42733 -1.08149
[8,] 24.7963 -1.91371 -0.07718
[9,] -0.4347 4.25604 -9.79139
[10,] 0.5476 0.47434 0.86615
这是一个dplyr/tidyr答案
library(dplyr)
library(tidyr)
wide_data =
x %>%
as.data.frame %>%
mutate(row = 1:n())
prefix = function(dataframe, prefix)
dataframe %>%
setNames(names(.) %>% paste(prefix, . , sep = "_")))
long_data =
wide_data %>%
gather(column, value, -row)
long_data %>% prefix("first") %>%
merge(long_data %>% prefix("second")) %>%
mutate(product = first_value * second_value,
ratio = second_value / first_value) %>%
select(-first_value, -second_value) %>%
gather(measure, value, product, ratio) %>%
unite(new_column, measure, first_column, second_column, sep = "_") %>%
spread(new_column, value) %>%
left_join(wide_data %>% prefix("first")) %>%
left_join(wide_data %>% prefix("second"))
我在 R 中有一些数据,其中有很多列。请以下面为例
x = replicate(5, rnorm(10))
colnames(x) = c('a','b','c','d','e')
我想计算每个组合的叉积和比率,并将它们附加到 table 的末尾。我还想给它们命名,以便它们与它们的计算结果相关
结果应该有像这样的额外列:
cp_a_b,
cp_a_c,
cp_a_d,
cp_a_e,
cp_b_c,
cp_b_d,
cp_b_e,
cp_c_d,
cp_c_e,
cp_d_e,
ratio_a_b,
ratio_a_c,
ratio_a_d,
ratio_a_e,
ratio_b_c,
ratio_b_d,
ratio_b_e,
ratio_c_d,
ratio_c_e,
ratio_d_e,
其中 cp 是叉积,ratio 是两列的比率 我想把它作为矩阵计算来做,所以它很快而不是循环
我在 R 方面还是个新手,但无论如何这里还是要尝试一下。为了娱乐!我不知道它是否有任何希望很快。大概是太天真了吧...
首先是一个示例矩阵 x
num_observations x num_features
的小随机整数。
num_features <- 5
num_observations <- 20
features <- letters[1:num_features]
x <- replicate(num_features, sample(1:10, num_observations, replace = T))
colnames(x) <- features
特征对的所有组合:
combinations <- combn(features, 2)
num_combinations = ncol(combinations)
对于每个特征对,我们将乘以 x
中的相应列。为新矩阵保留 space,其中相乘的列将结束:
y <- matrix(NA, ncol = num_combinations, nrow = num_observations)
cn <- rep("?", num_combinations) # column names of new features
乘以列组合:
for (i in 1:num_combinations)
{
cn[i] <- paste(combinations[1,i], combinations[2,i], sep = ".")
y[,i] <- x[,combinations[1,i]] * x[,combinations[2,i]]
}
colnames(y) <- cn
最终合并原始矩阵和附加特征:
x <- cbind(x, y)
为了简单起见,这只处理乘法,使用除法创建的其他特征当然是相似的。
更新
@nongkrong 在评论中建议的一个很好的方法放弃了显式循环,而只是:
y <- combn(split(x, col(x)), 2, FUN = function(cols) cols[[1]] * cols[[2]])
x <- cbind(x, y)
它没有明确设置新功能的列名,但更优雅,更易读。在一些快速的时间里,我做的也快了大约 30%!
基于 WhiteViking 和 bunk 的答案,下面是添加列名称的代码:
set.seed(1)
x = replicate(5, rnorm(10))
colnames(x) = c('a','b','c','d','e')
mult <- combn(split(x, col(x)), 2, FUN = function(cols) cols[[1]] * cols[[2]])
colnames(mult) <-paste("cp",combn(colnames(x), 2L, paste, collapse = "_"),sep="_")
ratio <- combn(split(x, col(x)), 2, FUN = function(cols) cols[[1]] / cols[[2]])
colnames(ratio) <-paste("ratio",combn(colnames(x), 2L, paste, collapse = "_"),sep="_")
cbind(x,mult,ratio)
> cbind(x,mult,ratio)
a b c d e cp_a_b cp_a_c cp_a_d
[1,] -0.6265 1.51178 0.91898 1.35868 -0.1645 -0.947061 -0.57570 -0.85115
[2,] 0.1836 0.38984 0.78214 -0.10279 -0.2534 0.071592 0.14363 -0.01888
[3,] -0.8356 -0.62124 0.07456 0.38767 0.6970 0.519126 -0.06231 -0.32395
[4,] 1.5953 -2.21470 -1.98935 -0.05381 0.5567 -3.533068 -3.17357 -0.08583
[5,] 0.3295 1.12493 0.61983 -1.37706 -0.6888 0.370673 0.20424 -0.45375
[6,] -0.8205 -0.04493 -0.05613 -0.41499 -0.7075 0.036867 0.04605 0.34049
[7,] 0.4874 -0.01619 -0.15580 -0.39429 0.3646 -0.007892 -0.07594 -0.19219
[8,] 0.7383 0.94384 -1.47075 -0.05931 0.7685 0.696858 -1.08589 -0.04379
[9,] 0.5758 0.82122 -0.47815 1.10003 -0.1123 0.472844 -0.27531 0.63337
[10,] -0.3054 0.59390 0.41794 0.76318 0.8811 -0.181371 -0.12763 -0.23307
cp_a_e cp_b_c cp_b_d cp_b_e cp_c_d cp_c_e cp_d_e
[1,] 0.10307 1.389293 2.054026 -0.248724 1.24860 -0.15119 -0.22353
[2,] -0.04653 0.304911 -0.040071 -0.098771 -0.08039 -0.19816 0.02604
[3,] -0.58240 -0.046323 -0.240837 -0.432982 0.02891 0.05197 0.27019
[4,] 0.88803 4.405817 0.119162 -1.232842 0.10704 -1.10740 -0.02995
[5,] -0.22695 0.697261 -1.549097 -0.774803 -0.85354 -0.42691 0.94846
[6,] 0.58048 0.002522 0.018647 0.031790 0.02329 0.03971 0.29361
[7,] 0.17771 0.002522 0.006384 -0.005903 0.06143 -0.05680 -0.14375
[8,] 0.56743 -1.388149 -0.055982 0.725369 0.08724 -1.13032 -0.04558
[9,] -0.06469 -0.392667 0.903364 -0.092261 -0.52598 0.05372 -0.12358
[10,] -0.26908 0.248216 0.453251 0.523291 0.31896 0.36825 0.67244
ratio_a_b ratio_a_c ratio_a_d ratio_a_e ratio_b_c ratio_b_d ratio_b_e
[1,] -0.4144 -0.6817 -0.4611 3.8077 1.6451 1.11268 -9.18884
[2,] 0.4711 0.2348 -1.7866 -0.7248 0.4984 -3.79270 -1.53868
[3,] 1.3451 -11.2067 -2.1555 -1.1990 -8.3315 -1.60249 -0.89135
[4,] -0.7203 -0.8019 -29.6493 2.8658 1.1133 41.16157 -3.97853
[5,] 0.2929 0.5316 -0.2393 -0.4784 1.8149 -0.81691 -1.63328
[6,] 18.2596 14.6176 1.9771 1.1597 0.8005 0.10828 0.06351
[7,] -30.1063 -3.1286 -1.2362 1.3370 0.1039 0.04106 -0.04441
[8,] 0.7823 -0.5020 -12.4479 0.9607 -0.6417 -15.91270 1.22810
[9,] 0.7011 -1.2042 0.5234 -5.1251 -1.7175 0.74655 -7.30974
[10,] -0.5142 -0.7307 -0.4002 -0.3466 1.4210 0.77820 0.67404
ratio_c_d ratio_c_e ratio_d_e
[1,] 0.6764 -5.58569 -8.25827
[2,] -7.6092 -3.08703 0.40570
[3,] 0.1923 0.10699 0.55623
[4,] 36.9733 -3.57371 -0.09666
[5,] -0.4501 -0.89992 1.99934
[6,] 0.1353 0.07933 0.58657
[7,] 0.3951 -0.42733 -1.08149
[8,] 24.7963 -1.91371 -0.07718
[9,] -0.4347 4.25604 -9.79139
[10,] 0.5476 0.47434 0.86615
这是一个dplyr/tidyr答案
library(dplyr)
library(tidyr)
wide_data =
x %>%
as.data.frame %>%
mutate(row = 1:n())
prefix = function(dataframe, prefix)
dataframe %>%
setNames(names(.) %>% paste(prefix, . , sep = "_")))
long_data =
wide_data %>%
gather(column, value, -row)
long_data %>% prefix("first") %>%
merge(long_data %>% prefix("second")) %>%
mutate(product = first_value * second_value,
ratio = second_value / first_value) %>%
select(-first_value, -second_value) %>%
gather(measure, value, product, ratio) %>%
unite(new_column, measure, first_column, second_column, sep = "_") %>%
spread(new_column, value) %>%
left_join(wide_data %>% prefix("first")) %>%
left_join(wide_data %>% prefix("second"))