如何使用 R 对大约 9000 个数字序列进行聚类?
How to cluster around 9000 sequences of numbers using R?
我有一个包含大约 9000 个数字序列的 csv 文件,我需要对其进行聚类。 csv 的前 6 行如下所示
id, sequence
"1","1 2"
"2","3 4 5 5 6 6 7 8 9 10 11 12 13 8 14 10 10 15 11 12 16"
"3","17 18 19 20 5 5 20 5 5"
"4","20 21"
"5","22 4 23 24 25 26"
我执行聚类的 R 代码如下所示
seqsim <- function(seq1, seq2){
seq1 <- as.character(seq1)
seq2 <- as.character(seq2)
s1 <- get1grams(seq1)
s2 <- get1grams(seq2)
intersection <- intersect(s1,s2)
if(length(intersection)==0){
return (1)
}
else{
u <- union(s1, s2)
score = length(intersection)/length(u)
return (1-score)
}
}
###############
mydata <- read.csv("sequence.csv")
mydatamatrix <- as.matrix(mydata$sequence)
# take the data in csv and create dist matrix
rownames(mydatamatrix) <- mydata$id
distance_matrix <- dist_make(mydatamatrix, seqsim, "SeqSim (custom)")
clusters <- hclust(distance_matrix, method = "complete")
plot(clusters)
clusterCut <- cutree(clusters, h=0.5)
# clustercut contains the clusterIDs assigned to each sequence or row of the input dataset
# Number of members in each cluster
table(mydata$id,clusterCut)
write.csv(clusterCut, file = "clusterIDs.csv")
该代码适用于大约 900 个左右的少量序列,但我遇到较大数据集的内存问题。
我的问题是:我是否以正确的方式进行聚类?是否有更快且内存效率更高的方法来使用 R 处理此类数据的聚类?
函数 seqsim 实际上返回的是距离而不是相似度,因为我返回的是 1 分。 Seqsim 正在调用我省略的其他方法以减少代码的长度。
我suspect/assume瓶颈是距离计算而不是聚类本身
以下是我的处理方法:
- 将文本处理与距离计算分开(这将防止您多次处理每个字符串)
- 使用R的
dist
函数或使用矩阵运算来计算距离矩阵(即jaccard index)。
- 小心尝试对 9000 个序列的聚类结果进行 pot,它肯定无法破译
- 9000 x 9000 矩阵将需要大量内存,因此这可能是您需要克服的下一个瓶颈,具体取决于您计算机的内存资源。
代码:
library(arules)
df <- read.table(text='id, sequence
"1","1 2"
"2","3 4 5 5 6 6 7 8 9 10 11 12 13 8 14 10 10 15 11 12 16"
"3","17 18 19 20 5 5 20 5 5"
"4","20 21"
"5","22 4 23 24 25 26"', header=TRUE, sep=",")
seq <- lapply(df$sequence, get1grams) #I am assuming that get1grams produces a vector
names(seq) <- paste0("seq_", df$id)
seqTrans <- as(seq, "transactions") #create a transactions object
seqMat <- as(seqTrans, "matrix") #turn the transactions object into an incidence matrix each row represents a sequence and each column a 1gram each cell presence/absence of the 1gram
seqMat <- +(seqMat) #convert boolean to 0/1
j.dist <- dist(seqMat, method = "binary") #make use of base R's distance function
##Matrix multiplication to calculate the jaccard distance
tseqMat <- t(seqMat)
a <- t(tseqMat) %*% tseqMat
b <- t(matrix(rep(1, length(tseqMat)), nrow = nrow(tseqMat), ncol = ncol(tseqMat))) %*% tseqMat
b <- b - a
c <- t(b)
j <- as.dist(1-a/(a+b+c))
clusters <- hclust(j, method = "complete")
plot(clusters)
clusterCut <- cutree(clusters, h=0.5)
我有一个包含大约 9000 个数字序列的 csv 文件,我需要对其进行聚类。 csv 的前 6 行如下所示
id, sequence
"1","1 2"
"2","3 4 5 5 6 6 7 8 9 10 11 12 13 8 14 10 10 15 11 12 16"
"3","17 18 19 20 5 5 20 5 5"
"4","20 21"
"5","22 4 23 24 25 26"
我执行聚类的 R 代码如下所示
seqsim <- function(seq1, seq2){
seq1 <- as.character(seq1)
seq2 <- as.character(seq2)
s1 <- get1grams(seq1)
s2 <- get1grams(seq2)
intersection <- intersect(s1,s2)
if(length(intersection)==0){
return (1)
}
else{
u <- union(s1, s2)
score = length(intersection)/length(u)
return (1-score)
}
}
###############
mydata <- read.csv("sequence.csv")
mydatamatrix <- as.matrix(mydata$sequence)
# take the data in csv and create dist matrix
rownames(mydatamatrix) <- mydata$id
distance_matrix <- dist_make(mydatamatrix, seqsim, "SeqSim (custom)")
clusters <- hclust(distance_matrix, method = "complete")
plot(clusters)
clusterCut <- cutree(clusters, h=0.5)
# clustercut contains the clusterIDs assigned to each sequence or row of the input dataset
# Number of members in each cluster
table(mydata$id,clusterCut)
write.csv(clusterCut, file = "clusterIDs.csv")
该代码适用于大约 900 个左右的少量序列,但我遇到较大数据集的内存问题。
我的问题是:我是否以正确的方式进行聚类?是否有更快且内存效率更高的方法来使用 R 处理此类数据的聚类? 函数 seqsim 实际上返回的是距离而不是相似度,因为我返回的是 1 分。 Seqsim 正在调用我省略的其他方法以减少代码的长度。
我suspect/assume瓶颈是距离计算而不是聚类本身
以下是我的处理方法:
- 将文本处理与距离计算分开(这将防止您多次处理每个字符串)
- 使用R的
dist
函数或使用矩阵运算来计算距离矩阵(即jaccard index)。 - 小心尝试对 9000 个序列的聚类结果进行 pot,它肯定无法破译
- 9000 x 9000 矩阵将需要大量内存,因此这可能是您需要克服的下一个瓶颈,具体取决于您计算机的内存资源。
代码:
library(arules)
df <- read.table(text='id, sequence
"1","1 2"
"2","3 4 5 5 6 6 7 8 9 10 11 12 13 8 14 10 10 15 11 12 16"
"3","17 18 19 20 5 5 20 5 5"
"4","20 21"
"5","22 4 23 24 25 26"', header=TRUE, sep=",")
seq <- lapply(df$sequence, get1grams) #I am assuming that get1grams produces a vector
names(seq) <- paste0("seq_", df$id)
seqTrans <- as(seq, "transactions") #create a transactions object
seqMat <- as(seqTrans, "matrix") #turn the transactions object into an incidence matrix each row represents a sequence and each column a 1gram each cell presence/absence of the 1gram
seqMat <- +(seqMat) #convert boolean to 0/1
j.dist <- dist(seqMat, method = "binary") #make use of base R's distance function
##Matrix multiplication to calculate the jaccard distance
tseqMat <- t(seqMat)
a <- t(tseqMat) %*% tseqMat
b <- t(matrix(rep(1, length(tseqMat)), nrow = nrow(tseqMat), ncol = ncol(tseqMat))) %*% tseqMat
b <- b - a
c <- t(b)
j <- as.dist(1-a/(a+b+c))
clusters <- hclust(j, method = "complete")
plot(clusters)
clusterCut <- cutree(clusters, h=0.5)