统计一个字符串出现的次数以及在上面各行中的归属结果
Count the appearance of a string and the belonging result in the rows above
我有一个这样的数据框:
df <- data.frame(value = c("a","b","b","d","a","b","b","d","a","b","c","d"),
pattern = c("NA","a","ab","abb","bbd","bda","dab","abb","bbd","bda","dab","abc"))
值列表示实际行为,模式显示此操作发生之前的累积行为。
现在我想将模式与上面的4个模式进行比较并计算出现次数,加上"value"-列中所属字母的出现次数,以计算预期结果。
结果应该是这样的:
value pattern apperance a b c d exp.result
1 a NA 0 0 0 0 0 <NA>
2 b a 0 0 0 0 0 <NA>
3 b ab 0 0 0 0 0 <NA>
4 d abb 0 0 0 0 0 <NA>
5 a bbd 0 0 0 0 0 <NA>
6 b bda 0 0 0 0 0 <NA>
7 b dab 0 0 0 0 0 <NA>
8 d abb 1 0 0 0 1 d
9 a bbd 1 1 0 0 0 a
10 b bda 1 0 1 0 0 b
11 c dab 1 0 1 0 0 b
12 d abc 0 0 0 0 0 <NA>
我希望有人能帮我解决这个问题。
包 zoo
中的函数 rollapply
可能会有帮助。
定义您的原始 data.frame 并加载包:
library(zoo)
df <- data.frame(value = c("a","b","b","d","a","b",
"b","d","a","b","c","d"),
pattern = c("NA","a","ab","abb","bbd","bda",
"dab","abb","bbd","bda","dab","abc"))
定义一个函数,它将吐出第五个元素在前四个元素中出现的次数:
f <- function(x) sum(x[5] == x[1:4])
应用此函数使用 rollapply
:
df$appearance <- rollapply(df$pattern, 5, f, align = 'right', fill = NA)
我不确定我是否正确解释了您的字母列,但您可以对单个字母使用相同(或相似)的函数,然后根据值列将结果列拆分为 4。
df$letters <- rollapply(df$value, 5, f, align = 'right', fill = NA)
df$a <- 0
df$a[df$value == 'a'] <- df$letters[df$value == 'a']
开始时如何处理 NA 值由您决定。
如果我可以猜一猜,您似乎正在使用 DNA 密码子。万一您还没有这样做,您可能想看看现有的软件包。 Bioconductor 特别有许多用于处理生物数据的有用的。
您可以使用这种方法:
df <- data.frame(
value = c("a","b","b","d","a","b","b","d","a","b","c","d"),
pattern = c(NA,"a","ab","abb","bbd","bda","dab","abb","bbd","bda","dab","abc"))
win <- 4
analyzeWindow <- function(idx){
idxs <- max(1,idx-win):(idx-1)
if(idx == 1) idxs <- integer()
winDF <- df[idxs,]
winDF <- winDF[na.omit(winDF$pattern == df$pattern[idx]),]
expValWeights <- unlist(as.list(table(winDF$value)))
c(appearances=nrow(winDF),expValWeights)
}
newCols <- t(sapply(1:nrow(df),analyzeWindow))
df2 <- cbind(df,newCols)
df2$exp.result <- colnames(newCols)[-1][max.col(newCols[,-1],ties.method='first')]
df2$exp.result[rowSums(newCols[,-1]) == 0] <- NA
> df2
value pattern appearances a b c d exp.result
1 a <NA> 0 0 0 0 0 <NA>
2 b a 0 0 0 0 0 <NA>
3 b ab 0 0 0 0 0 <NA>
4 d abb 0 0 0 0 0 <NA>
5 a bbd 0 0 0 0 0 <NA>
6 b bda 0 0 0 0 0 <NA>
7 b dab 0 0 0 0 0 <NA>
8 d abb 1 0 0 0 1 d
9 a bbd 1 1 0 0 0 a
10 b bda 1 0 1 0 0 b
11 c dab 1 0 1 0 0 b
12 d abc 0 0 0 0 0 <NA>
注意:
此代码要求 "value" 列为 factor 类型。如果不是,请使用 as.factor
。
我有一个这样的数据框:
df <- data.frame(value = c("a","b","b","d","a","b","b","d","a","b","c","d"),
pattern = c("NA","a","ab","abb","bbd","bda","dab","abb","bbd","bda","dab","abc"))
值列表示实际行为,模式显示此操作发生之前的累积行为。 现在我想将模式与上面的4个模式进行比较并计算出现次数,加上"value"-列中所属字母的出现次数,以计算预期结果。
结果应该是这样的:
value pattern apperance a b c d exp.result
1 a NA 0 0 0 0 0 <NA>
2 b a 0 0 0 0 0 <NA>
3 b ab 0 0 0 0 0 <NA>
4 d abb 0 0 0 0 0 <NA>
5 a bbd 0 0 0 0 0 <NA>
6 b bda 0 0 0 0 0 <NA>
7 b dab 0 0 0 0 0 <NA>
8 d abb 1 0 0 0 1 d
9 a bbd 1 1 0 0 0 a
10 b bda 1 0 1 0 0 b
11 c dab 1 0 1 0 0 b
12 d abc 0 0 0 0 0 <NA>
我希望有人能帮我解决这个问题。
包 zoo
中的函数 rollapply
可能会有帮助。
定义您的原始 data.frame 并加载包:
library(zoo)
df <- data.frame(value = c("a","b","b","d","a","b",
"b","d","a","b","c","d"),
pattern = c("NA","a","ab","abb","bbd","bda",
"dab","abb","bbd","bda","dab","abc"))
定义一个函数,它将吐出第五个元素在前四个元素中出现的次数:
f <- function(x) sum(x[5] == x[1:4])
应用此函数使用 rollapply
:
df$appearance <- rollapply(df$pattern, 5, f, align = 'right', fill = NA)
我不确定我是否正确解释了您的字母列,但您可以对单个字母使用相同(或相似)的函数,然后根据值列将结果列拆分为 4。
df$letters <- rollapply(df$value, 5, f, align = 'right', fill = NA)
df$a <- 0
df$a[df$value == 'a'] <- df$letters[df$value == 'a']
开始时如何处理 NA 值由您决定。
如果我可以猜一猜,您似乎正在使用 DNA 密码子。万一您还没有这样做,您可能想看看现有的软件包。 Bioconductor 特别有许多用于处理生物数据的有用的。
您可以使用这种方法:
df <- data.frame(
value = c("a","b","b","d","a","b","b","d","a","b","c","d"),
pattern = c(NA,"a","ab","abb","bbd","bda","dab","abb","bbd","bda","dab","abc"))
win <- 4
analyzeWindow <- function(idx){
idxs <- max(1,idx-win):(idx-1)
if(idx == 1) idxs <- integer()
winDF <- df[idxs,]
winDF <- winDF[na.omit(winDF$pattern == df$pattern[idx]),]
expValWeights <- unlist(as.list(table(winDF$value)))
c(appearances=nrow(winDF),expValWeights)
}
newCols <- t(sapply(1:nrow(df),analyzeWindow))
df2 <- cbind(df,newCols)
df2$exp.result <- colnames(newCols)[-1][max.col(newCols[,-1],ties.method='first')]
df2$exp.result[rowSums(newCols[,-1]) == 0] <- NA
> df2
value pattern appearances a b c d exp.result
1 a <NA> 0 0 0 0 0 <NA>
2 b a 0 0 0 0 0 <NA>
3 b ab 0 0 0 0 0 <NA>
4 d abb 0 0 0 0 0 <NA>
5 a bbd 0 0 0 0 0 <NA>
6 b bda 0 0 0 0 0 <NA>
7 b dab 0 0 0 0 0 <NA>
8 d abb 1 0 0 0 1 d
9 a bbd 1 1 0 0 0 a
10 b bda 1 0 1 0 0 b
11 c dab 1 0 1 0 0 b
12 d abc 0 0 0 0 0 <NA>
注意:
此代码要求 "value" 列为 factor 类型。如果不是,请使用 as.factor
。