R中字符串的聚类序列
Cluster sequences of strings in R
我需要以下数据:
attributes <- c("apple-water-orange", "apple-water", "apple-orange", "coffee", "coffee-croissant", "green-red-yellow", "green-red-blue", "green-red","black-white","black-white-purple")
attributes
attributes
1 apple-water-orange
2 apple-water
3 apple-orange
4 coffee
5 coffee-croissant
6 green-red-yellow
7 green-red-blue
8 green-red
9 black-white
10 black-white-purple
我想要的是另一列,它根据观察相似性为每一行分配一个类别。
category <- c(1,1,1,2,2,3,3,3,4,4)
df <- as.data.frame(cbind(df, category))
attributes category
1 apple-water-orange 1
2 apple-water 1
3 apple-orange 1
4 coffee 2
5 coffee-croissant 2
6 green-red-yellow 3
7 green-red-blue 3
8 green-red 3
9 black-white 4
10 black-white-purple 4
它是广义上的聚类,但我认为大多数聚类方法仅针对数字数据,one-hot-encoding有很多缺点(这是我在网上看到的)。
有人知道如何完成这项任务吗?也许一些单词匹配方法?
如果我能根据参数调整相似度(粗略与体面"clustering")也很好。
提前感谢任何想法!
所以我想到了两种可能性。选项 1:使用“one-hot-encoding”,只要 apple/apples 与 apple/orange 同样不同,它就简单明了。我使用 Jaccard 索引作为距离度量,因为它在处理重叠集时效果相当好。选项 2:使用局部序列比对算法,并且应该对 apple/apples 与 apple/orange 之类的事情非常稳健,它还会有更多的调整参数,这可能需要时间来优化您的问题。
library(reshape2)
library(proxy)
attributes <- c("apple-water-orange", "apple-water", "apple-orange", "coffee",
"coffee-croissant", "green-red-yellow", "green-red-blue",
"green-red","black-white","black-white-purple")
dat <- data.frame(attr=attributes, row.names = paste("id", seq_along(attributes), sep=""))
attributesList <- strsplit(attributes, "-")
df <- data.frame(id=paste("id", rep(seq_along(attributesList), sapply(attributesList, length)), sep=""),
word=unlist(attributesList))
df.wide <- dcast(data=df, word ~ id, length)
rownames(df.wide) <- df.wide[, 1]
df.wide <- as.matrix(df.wide[, -1])
df.dist <- dist(t(df.wide), method="jaccard")
plot(hclust(df.dist))
abline(h=c(0.6, 0.8))
heatmap.2(df.wide, trace="none", col=rev(heat.colors(15)))
res <- merge(dat, data.frame(cat1=cutree(hclust(df.dist), h=0.8)), by="row.names")
res <- merge(res, data.frame(cat2=cutree(hclust(df.dist), h=0.6)), by.y="row.names", by.x="Row.names")
res
您会发现可以通过调整切割树状图的位置来控制分类的粒度。
这里是使用“Smith-Waterman”对齐方式(本地)对齐方式
Biostrings 是 Bioconductor project. The SW algorithm finds the optimal local (non-end-to-end) alignment of two sequences (strings). In this case you can again use cutree
to set your categories but you can also tune the scoring function 的一部分,可以满足您的需要。
library(Biostrings)
strList <- lapply(attributes, BString)
swDist <- matrix(apply(expand.grid(seq_along(strList), seq_along(strList)), 1, function(x) {
pairwiseAlignment(strList[[x[1]]], strList[[x[2]]], type="local")@score
}), nrow = 10)
heatmap.2(swDist, trace="none", col = rev(heat.colors(15)),
labRow = paste("id", 1:10, sep=""), labCol = paste("id", 1:10, sep=""))
我需要以下数据:
attributes <- c("apple-water-orange", "apple-water", "apple-orange", "coffee", "coffee-croissant", "green-red-yellow", "green-red-blue", "green-red","black-white","black-white-purple")
attributes
attributes
1 apple-water-orange
2 apple-water
3 apple-orange
4 coffee
5 coffee-croissant
6 green-red-yellow
7 green-red-blue
8 green-red
9 black-white
10 black-white-purple
我想要的是另一列,它根据观察相似性为每一行分配一个类别。
category <- c(1,1,1,2,2,3,3,3,4,4)
df <- as.data.frame(cbind(df, category))
attributes category
1 apple-water-orange 1
2 apple-water 1
3 apple-orange 1
4 coffee 2
5 coffee-croissant 2
6 green-red-yellow 3
7 green-red-blue 3
8 green-red 3
9 black-white 4
10 black-white-purple 4
它是广义上的聚类,但我认为大多数聚类方法仅针对数字数据,one-hot-encoding有很多缺点(这是我在网上看到的)。
有人知道如何完成这项任务吗?也许一些单词匹配方法?
如果我能根据参数调整相似度(粗略与体面"clustering")也很好。
提前感谢任何想法!
所以我想到了两种可能性。选项 1:使用“one-hot-encoding”,只要 apple/apples 与 apple/orange 同样不同,它就简单明了。我使用 Jaccard 索引作为距离度量,因为它在处理重叠集时效果相当好。选项 2:使用局部序列比对算法,并且应该对 apple/apples 与 apple/orange 之类的事情非常稳健,它还会有更多的调整参数,这可能需要时间来优化您的问题。
library(reshape2)
library(proxy)
attributes <- c("apple-water-orange", "apple-water", "apple-orange", "coffee",
"coffee-croissant", "green-red-yellow", "green-red-blue",
"green-red","black-white","black-white-purple")
dat <- data.frame(attr=attributes, row.names = paste("id", seq_along(attributes), sep=""))
attributesList <- strsplit(attributes, "-")
df <- data.frame(id=paste("id", rep(seq_along(attributesList), sapply(attributesList, length)), sep=""),
word=unlist(attributesList))
df.wide <- dcast(data=df, word ~ id, length)
rownames(df.wide) <- df.wide[, 1]
df.wide <- as.matrix(df.wide[, -1])
df.dist <- dist(t(df.wide), method="jaccard")
plot(hclust(df.dist))
abline(h=c(0.6, 0.8))
heatmap.2(df.wide, trace="none", col=rev(heat.colors(15)))
res <- merge(dat, data.frame(cat1=cutree(hclust(df.dist), h=0.8)), by="row.names")
res <- merge(res, data.frame(cat2=cutree(hclust(df.dist), h=0.6)), by.y="row.names", by.x="Row.names")
res
您会发现可以通过调整切割树状图的位置来控制分类的粒度。
这里是使用“Smith-Waterman”对齐方式(本地)对齐方式
Biostrings 是 Bioconductor project. The SW algorithm finds the optimal local (non-end-to-end) alignment of two sequences (strings). In this case you can again use cutree
to set your categories but you can also tune the scoring function 的一部分,可以满足您的需要。
library(Biostrings)
strList <- lapply(attributes, BString)
swDist <- matrix(apply(expand.grid(seq_along(strList), seq_along(strList)), 1, function(x) {
pairwiseAlignment(strList[[x[1]]], strList[[x[2]]], type="local")@score
}), nrow = 10)
heatmap.2(swDist, trace="none", col = rev(heat.colors(15)),
labRow = paste("id", 1:10, sep=""), labCol = paste("id", 1:10, sep=""))