R 中是否有对 dist 函数的稀疏支持?
Is there any sparse support for dist function in R?
有没有人听说过任何与创建
的 dist{stats}
函数相同的包或功能
distance matrix that is computed by using the specified distance measure to compute the distances between the rows of a data matrix,
但以sprase矩阵作为输入?
我的 data.frame(名为 dataCluster
)的亮度为:7000 X 10000,稀疏度几乎为 99%。在不稀疏的常规形式中,此功能似乎不会停止工作...
h1 <- hclust( dist( dataCluster ) , method = "complete" )
没有答案的类似问题:
Sparse Matrix as input to Hierarchical clustering in R
它接受来自 Matrix
包的稀疏矩阵(从文档中不清楚)并且还可以做交叉距离,输出 Matrix
和 dist
对象等等.
虽然默认距离度量是 'cosine'
,因此如果需要,请务必指定 method = 'euclidean'
。
**更新:**事实上,您可以很容易地完成 qlcMatrix 所做的事情:
sparse.cos <- function(x, y = NULL, drop = TRUE){
if(!is.null(y)){
if(class(x) != "dgCMatrix" || class(y) != "dgCMatrix") stop ("class(x) or class(y) != dgCMatrix")
if(drop == TRUE) colnames(x) <- rownames(x) <- colnames(y) <- rownames(y) <- NULL
crossprod(
tcrossprod(
x,
Diagonal(x = as.vector(crossprod(x ^ 2, rep(1, x@Dim[1]))) ^ -0.5)
),
tcrossprod(
y,
Diagonal(x = as.vector(crossprod(y ^ 2, rep(1, x@Dim[1]))) ^ -0.5))
)
)
} else {
if(class(x) != "dgCMatrix") stop ("class(x) != dgCMatrix")
if(drop == TRUE) colnames(x) <- rownames(X) <- NULL
crossprod(
tcrossprod(
x,
Diagonal(x = as.vector(crossprod(x ^ 2, rep(1, nrow(x)))) ^ -0.5))
)
}
}
我发现上面的和 qlcMatrix::cosSparse
在性能上没有显着差异。
qlcMatrix::cosSparse
当数据稀疏度 >50% 或在输入矩阵的最长边(即高格式)上计算相似度时,qlcMatrix::cosSparse
比 wordspace::dist.matrix
快。
wordspace::dist.matrix
与 qlcMatrix::cosSparse
在不同稀疏度(10%、50%、90% 或 99% 稀疏度)的宽矩阵 (1000 x 5000) 上计算 1000 的性能x 1000 相似度:
# M1 is 10% sparse, M99 is 99% sparse
set.seed(123)
M10 <- rsparsematrix(5000, 1000, density = 1)
M50 <- rsparsematrix(5000, 1000, density = 0.5)
M90 <- rsparsematrix(5000, 1000, density = 0.1)
M99 <- rsparsematrix(5000, 1000, density = 0.01)
tM10 <- t(M10)
tM50 <- t(M50)
tM90 <- t(M90)
tM99 <- t(M99)
benchmark(
"cosSparse: 10% sparse" = cosSparse(M10),
"cosSparse: 50% sparse" = cosSparse(M50),
"cosSparse: 90% sparse" = cosSparse(M90),
"cosSparse: 99% sparse" = cosSparse(M99),
"wordspace: 10% sparse" = dist.matrix(tM10, byrow = TRUE),
"wordspace: 50% sparse" = dist.matrix(tM50, byrow = TRUE),
"wordspace: 90% sparse" = dist.matrix(tM90, byrow = TRUE),
"wordspace: 99% sparse" = dist.matrix(tM99, byrow = TRUE),
replications = 2, columns = c("test", "elapsed", "relative"))
这两个函数非常相似,wordspace 在低稀疏度时略有领先,但在高稀疏度时绝对不是:
test elapsed relative
1 cosSparse: 10% sparse 15.83 527.667
2 cosSparse: 50% sparse 4.72 157.333
3 cosSparse: 90% sparse 0.31 10.333
4 cosSparse: 99% sparse 0.03 1.000
5 wordspace: 10% sparse 15.23 507.667
6 wordspace: 50% sparse 4.28 142.667
7 wordspace: 90% sparse 0.36 12.000
8 wordspace: 99% sparse 0.09 3.000
如果我们翻转计算以计算 5000 x 5000 矩阵,则:
benchmark(
"cosSparse: 50% sparse" = cosSparse(tM50),
"cosSparse: 90% sparse" = cosSparse(tM90),
"cosSparse: 99% sparse" = cosSparse(tM99),
"wordspace: 50% sparse" = dist.matrix(M50, byrow = TRUE),
"wordspace: 90% sparse" = dist.matrix(M90, byrow = TRUE),
"wordspace: 99% sparse" = dist.matrix(M99, byrow = TRUE),
replications = 1, columns = c("test", "elapsed", "relative"))
现在cosSparse的竞争优势变得非常明显:
test elapsed relative
1 cosSparse: 50% sparse 10.58 151.143
2 cosSparse: 90% sparse 1.44 20.571
3 cosSparse: 99% sparse 0.07 1.000
4 wordspace: 50% sparse 11.41 163.000
5 wordspace: 90% sparse 2.39 34.143
6 wordspace: 99% sparse 0.64 9.143
在 50% 的稀疏度下,效率的变化不是很显着,但在 90% 的稀疏度下,wordspace 慢了 1.6 倍,而在 99% 的稀疏度下,它慢了近 10 倍!
将此性能与方阵进行比较:
M50.square <- rsparsematrix(1000, 1000, density = 0.5)
tM50.square <- t(M50.square)
M90.square <- rsparsematrix(1000, 1000, density = 0.1)
tM90.square <- t(M90.square)
benchmark(
"cosSparse: square, 50% sparse" = cosSparse(M50.square),
"wordspace: square, 50% sparse" = dist.matrix(tM50.square, byrow = TRUE),
"cosSparse: square, 90% sparse" = cosSparse(M90.square),
"wordspace: square, 90% sparse" = dist.matrix(tM90.square, byrow = TRUE),
replications = 5, columns = c("test", "elapsed", "relative"))
cosSparse 在稀疏度为 50% 时略快,在稀疏度为 90% 时几乎快两倍!
test elapsed relative
1 cosSparse: square, 50% sparse 2.12 9.217
3 cosSparse: square, 90% sparse 0.23 1.000
2 wordspace: square, 50% sparse 2.15 9.348
4 wordspace: square, 90% sparse 0.40 1.739
请注意,wordspace::dist.matrix
比 qlcMatrix::cosSparse
具有更多的边缘情况检查,并且还允许通过 R 中的 openmp
进行并行化。此外,wordspace::dist.matrix
支持欧几里德和杰卡德距离测量,尽管这些要慢得多。该软件包中内置了许多其他方便的功能。
就是说,如果您只需要余弦相似度,并且您的矩阵稀疏度 >50%,并且您正在计算 tall 方法,cosSparse
应该是首选工具。
有没有人听说过任何与创建
的dist{stats}
函数相同的包或功能
distance matrix that is computed by using the specified distance measure to compute the distances between the rows of a data matrix,
但以sprase矩阵作为输入?
我的 data.frame(名为 dataCluster
)的亮度为:7000 X 10000,稀疏度几乎为 99%。在不稀疏的常规形式中,此功能似乎不会停止工作...
h1 <- hclust( dist( dataCluster ) , method = "complete" )
没有答案的类似问题: Sparse Matrix as input to Hierarchical clustering in R
它接受来自 Matrix
包的稀疏矩阵(从文档中不清楚)并且还可以做交叉距离,输出 Matrix
和 dist
对象等等.
虽然默认距离度量是 'cosine'
,因此如果需要,请务必指定 method = 'euclidean'
。
**更新:**事实上,您可以很容易地完成 qlcMatrix 所做的事情:
sparse.cos <- function(x, y = NULL, drop = TRUE){
if(!is.null(y)){
if(class(x) != "dgCMatrix" || class(y) != "dgCMatrix") stop ("class(x) or class(y) != dgCMatrix")
if(drop == TRUE) colnames(x) <- rownames(x) <- colnames(y) <- rownames(y) <- NULL
crossprod(
tcrossprod(
x,
Diagonal(x = as.vector(crossprod(x ^ 2, rep(1, x@Dim[1]))) ^ -0.5)
),
tcrossprod(
y,
Diagonal(x = as.vector(crossprod(y ^ 2, rep(1, x@Dim[1]))) ^ -0.5))
)
)
} else {
if(class(x) != "dgCMatrix") stop ("class(x) != dgCMatrix")
if(drop == TRUE) colnames(x) <- rownames(X) <- NULL
crossprod(
tcrossprod(
x,
Diagonal(x = as.vector(crossprod(x ^ 2, rep(1, nrow(x)))) ^ -0.5))
)
}
}
我发现上面的和 qlcMatrix::cosSparse
在性能上没有显着差异。
qlcMatrix::cosSparse
当数据稀疏度 >50% 或在输入矩阵的最长边(即高格式)上计算相似度时,qlcMatrix::cosSparse
比 wordspace::dist.matrix
快。
wordspace::dist.matrix
与 qlcMatrix::cosSparse
在不同稀疏度(10%、50%、90% 或 99% 稀疏度)的宽矩阵 (1000 x 5000) 上计算 1000 的性能x 1000 相似度:
# M1 is 10% sparse, M99 is 99% sparse
set.seed(123)
M10 <- rsparsematrix(5000, 1000, density = 1)
M50 <- rsparsematrix(5000, 1000, density = 0.5)
M90 <- rsparsematrix(5000, 1000, density = 0.1)
M99 <- rsparsematrix(5000, 1000, density = 0.01)
tM10 <- t(M10)
tM50 <- t(M50)
tM90 <- t(M90)
tM99 <- t(M99)
benchmark(
"cosSparse: 10% sparse" = cosSparse(M10),
"cosSparse: 50% sparse" = cosSparse(M50),
"cosSparse: 90% sparse" = cosSparse(M90),
"cosSparse: 99% sparse" = cosSparse(M99),
"wordspace: 10% sparse" = dist.matrix(tM10, byrow = TRUE),
"wordspace: 50% sparse" = dist.matrix(tM50, byrow = TRUE),
"wordspace: 90% sparse" = dist.matrix(tM90, byrow = TRUE),
"wordspace: 99% sparse" = dist.matrix(tM99, byrow = TRUE),
replications = 2, columns = c("test", "elapsed", "relative"))
这两个函数非常相似,wordspace 在低稀疏度时略有领先,但在高稀疏度时绝对不是:
test elapsed relative
1 cosSparse: 10% sparse 15.83 527.667
2 cosSparse: 50% sparse 4.72 157.333
3 cosSparse: 90% sparse 0.31 10.333
4 cosSparse: 99% sparse 0.03 1.000
5 wordspace: 10% sparse 15.23 507.667
6 wordspace: 50% sparse 4.28 142.667
7 wordspace: 90% sparse 0.36 12.000
8 wordspace: 99% sparse 0.09 3.000
如果我们翻转计算以计算 5000 x 5000 矩阵,则:
benchmark(
"cosSparse: 50% sparse" = cosSparse(tM50),
"cosSparse: 90% sparse" = cosSparse(tM90),
"cosSparse: 99% sparse" = cosSparse(tM99),
"wordspace: 50% sparse" = dist.matrix(M50, byrow = TRUE),
"wordspace: 90% sparse" = dist.matrix(M90, byrow = TRUE),
"wordspace: 99% sparse" = dist.matrix(M99, byrow = TRUE),
replications = 1, columns = c("test", "elapsed", "relative"))
现在cosSparse的竞争优势变得非常明显:
test elapsed relative
1 cosSparse: 50% sparse 10.58 151.143
2 cosSparse: 90% sparse 1.44 20.571
3 cosSparse: 99% sparse 0.07 1.000
4 wordspace: 50% sparse 11.41 163.000
5 wordspace: 90% sparse 2.39 34.143
6 wordspace: 99% sparse 0.64 9.143
在 50% 的稀疏度下,效率的变化不是很显着,但在 90% 的稀疏度下,wordspace 慢了 1.6 倍,而在 99% 的稀疏度下,它慢了近 10 倍!
将此性能与方阵进行比较:
M50.square <- rsparsematrix(1000, 1000, density = 0.5)
tM50.square <- t(M50.square)
M90.square <- rsparsematrix(1000, 1000, density = 0.1)
tM90.square <- t(M90.square)
benchmark(
"cosSparse: square, 50% sparse" = cosSparse(M50.square),
"wordspace: square, 50% sparse" = dist.matrix(tM50.square, byrow = TRUE),
"cosSparse: square, 90% sparse" = cosSparse(M90.square),
"wordspace: square, 90% sparse" = dist.matrix(tM90.square, byrow = TRUE),
replications = 5, columns = c("test", "elapsed", "relative"))
cosSparse 在稀疏度为 50% 时略快,在稀疏度为 90% 时几乎快两倍!
test elapsed relative
1 cosSparse: square, 50% sparse 2.12 9.217
3 cosSparse: square, 90% sparse 0.23 1.000
2 wordspace: square, 50% sparse 2.15 9.348
4 wordspace: square, 90% sparse 0.40 1.739
请注意,wordspace::dist.matrix
比 qlcMatrix::cosSparse
具有更多的边缘情况检查,并且还允许通过 R 中的 openmp
进行并行化。此外,wordspace::dist.matrix
支持欧几里德和杰卡德距离测量,尽管这些要慢得多。该软件包中内置了许多其他方便的功能。
就是说,如果您只需要余弦相似度,并且您的矩阵稀疏度 >50%,并且您正在计算 tall 方法,cosSparse
应该是首选工具。