R cast 无法处理唯一行
R cast can't deal with unique rows
问题
我有 cluster.id
并且对应于这些 cluster.id 我在每个集群中找到了不同的 letters
(作为简化)。
我感兴趣的是在不同的集群中通常会发现哪些字母在一起(我使用了这个 中的代码),但是我对找到每个字母的比例不感兴趣,所以我想删除重复的行(见下面的代码)。
这看起来很有效(没有错误)但是转换矩阵充满了 'NA'
和字符串而不是所需的计数(我在下面的代码注释中进一步解释了所有内容)。
关于如何解决此问题的任何建议,或者这只是在过滤唯一行后无法解决的问题?
代码
test.set <- read.table(text = "
cluster.id letters
1 4 A
2 4 B
3 4 B
4 3 A
5 3 E
6 3 D
7 3 C
8 2 A
9 2 E
10 1 A", header = T, stringsAsFactors = F)
# remove irrelevant clusters (clusters which only contain 1 letter)
test.set <- test.set %>% group_by( cluster.id ) %>%
mutate(n.letters = n_distinct(letters)) %>%
filter(n.letters > 1) %>%
ungroup() %>%
select( -n.letters)
test.set
# cluster.id letters
#<int> <chr>
#1 4 A
#2 4 B
#3 4 B
#4 3 A
#5 3 E
#6 3 D
#7 3 C
#8 2 A
#9 2 E
# I dont want duplicated rows becasue they are misleading.
# I'm only interested in which letters are found togheter in a
# cluster not in what proportions
# Therefore I want to remove these duplicated rows
test.set.unique <- test.set %>% unique()
matrix <- acast(test.set.unique, cluster.id ~ letters)
matrix
# A B C D E
#2 "A" NA NA NA "E"
#3 "A" NA "C" "D" "E"
#4 "A" "B" NA NA NA
# This matrix contains NA values and letters intead of the counts I wanted.
# However using the matrix before filtering for unique rows works fine
matrix <- acast(test.set, cluster.id ~ letters)
matrix
# A B C D E
#2 1 0 0 0 1
#3 1 0 1 1 1
#4 1 2 0 0 0
如果我们也查看消息,输出上方会有一条消息
Aggregation function missing: defaulting to length
为了得到类似的输出,指定fun.aggregate
acast(test.set.unique, cluster.id ~ letters, length)
# A B C D E
#2 1 0 0 0 1
#3 1 0 1 1 1
#4 1 1 0 0 0
当存在重复元素时,默认情况下会为 length
触发 fun.aggregate
。使用 unique
元素,在不指定 fun.aggregate
的情况下,它将假定一个 value.var
列并填充该列的值以获得与 OP 的 post[=18 中一样的输出=]
问题
我有 cluster.id
并且对应于这些 cluster.id 我在每个集群中找到了不同的 letters
(作为简化)。
我感兴趣的是在不同的集群中通常会发现哪些字母在一起(我使用了这个
这看起来很有效(没有错误)但是转换矩阵充满了 'NA'
和字符串而不是所需的计数(我在下面的代码注释中进一步解释了所有内容)。
关于如何解决此问题的任何建议,或者这只是在过滤唯一行后无法解决的问题?
代码
test.set <- read.table(text = "
cluster.id letters
1 4 A
2 4 B
3 4 B
4 3 A
5 3 E
6 3 D
7 3 C
8 2 A
9 2 E
10 1 A", header = T, stringsAsFactors = F)
# remove irrelevant clusters (clusters which only contain 1 letter)
test.set <- test.set %>% group_by( cluster.id ) %>%
mutate(n.letters = n_distinct(letters)) %>%
filter(n.letters > 1) %>%
ungroup() %>%
select( -n.letters)
test.set
# cluster.id letters
#<int> <chr>
#1 4 A
#2 4 B
#3 4 B
#4 3 A
#5 3 E
#6 3 D
#7 3 C
#8 2 A
#9 2 E
# I dont want duplicated rows becasue they are misleading.
# I'm only interested in which letters are found togheter in a
# cluster not in what proportions
# Therefore I want to remove these duplicated rows
test.set.unique <- test.set %>% unique()
matrix <- acast(test.set.unique, cluster.id ~ letters)
matrix
# A B C D E
#2 "A" NA NA NA "E"
#3 "A" NA "C" "D" "E"
#4 "A" "B" NA NA NA
# This matrix contains NA values and letters intead of the counts I wanted.
# However using the matrix before filtering for unique rows works fine
matrix <- acast(test.set, cluster.id ~ letters)
matrix
# A B C D E
#2 1 0 0 0 1
#3 1 0 1 1 1
#4 1 2 0 0 0
如果我们也查看消息,输出上方会有一条消息
Aggregation function missing: defaulting to length
为了得到类似的输出,指定fun.aggregate
acast(test.set.unique, cluster.id ~ letters, length)
# A B C D E
#2 1 0 0 0 1
#3 1 0 1 1 1
#4 1 1 0 0 0
当存在重复元素时,默认情况下会为 length
触发 fun.aggregate
。使用 unique
元素,在不指定 fun.aggregate
的情况下,它将假定一个 value.var
列并填充该列的值以获得与 OP 的 post[=18 中一样的输出=]