在滑动中查找特定向量条目 Window
Finding Specific Vector Entries in a Sliding Window
我正在尝试创建一个函数,该函数将 return 特定相邻核苷酸(彼此相邻的 CG)计数在我已格式化为向量的特定 window 中。
我希望 windows 的长度为 100 个核苷酸,并且每 10 个核苷酸移动一次。
数据设置如下(至 10k 个条目):
data <- c("a", "g", "t", "t", "g", "t", "t", "a", "g", "t", "c", "t",
"a", "c", "g", "t", "g", "g", "a", "c", "c", "g", "a", "c")
到目前为止我试过这个:
library(zoo)
library(seqinr)
rollapply(data, width=100, by=10, FUN=count(data, wordsize=2))
但我总是得到错误
"Error in match.fun(FUN) :
'count(data, 2)' is not a function, character or symbol"
我也试过:
starts <- seq(1, length(data)-100, by = 100)
n <- length(starts)
for (i in 1:n){
chunk <- data[starts[i]:(starts[i]+99)]
chunkCG <- count(chunk,wordsize=2)
print (chunkCG)
}
但是,我不知道如何保存 returned 的数据。这种方法也不允许我重叠帧。
您的方法没有重叠,因为您用 by = 100
调用它。否则它看起来不错。只需将其更改为 10.
要从您上次尝试中提取数据,请尝试创建将收集数据的字符向量,然后您可以使用名称索引提取正确的计数。
counted_cg <- vector(mode = "character")
for (i in 1:n){
chunk <- data[starts[i]:(starts[i]+99)]
chunkCG <- count(chunk,wordsize=2)
counted_cg <- c(counted_cg, chunkCG["cg"])
}
编辑:要通过 10 次观察滑动 window 获得所需的输出,您可以使用 for
循环。由于我们预先分配了结果向量的大小,因此循环相当快。我认为这是解决问题的最佳方法,因为我不认为很多分组(如果有的话)支持滑动 window:
library(data.table)
set.seed(1)
#Sample data
df<-data.frame(var=sample(c("a","g","t","c"),600,replace=T))
#The number of windows you want, shift by 10 each time
n_windows <- ((nrow(df) - 100) / 10) + 1
#Create empty DF, this helps increase speed of below loop
res <- data.frame(window=rep(NA,n_windows),count_cg=rep(NA,n_windows))
#Loop over each i, paste a leaded version of your sequence onto current sequence and count "cg"s
for (i in 1:n_windows){
res$window[i] <- paste0((i-1)*10 + 1,"-",(i-1)*10 + 100)
subs <- df[((i-1)*10 + 1):((i-1)*10 + 100),"var"]
subs2<- paste0(as.character(subs),as.character(shift(subs,1L,type="lead")[1:length(subs) - 1]))
res$count_cg[i] <- sum(subs2=="cg")
}
head(res)
window count_cg
1 1-100 10
2 11-110 10
3 21-120 8
4 31-130 9
5 41-140 9
6 51-150 9
我正在尝试创建一个函数,该函数将 return 特定相邻核苷酸(彼此相邻的 CG)计数在我已格式化为向量的特定 window 中。
我希望 windows 的长度为 100 个核苷酸,并且每 10 个核苷酸移动一次。
数据设置如下(至 10k 个条目):
data <- c("a", "g", "t", "t", "g", "t", "t", "a", "g", "t", "c", "t",
"a", "c", "g", "t", "g", "g", "a", "c", "c", "g", "a", "c")
到目前为止我试过这个:
library(zoo)
library(seqinr)
rollapply(data, width=100, by=10, FUN=count(data, wordsize=2))
但我总是得到错误
"Error in match.fun(FUN) :
'count(data, 2)' is not a function, character or symbol"
我也试过:
starts <- seq(1, length(data)-100, by = 100)
n <- length(starts)
for (i in 1:n){
chunk <- data[starts[i]:(starts[i]+99)]
chunkCG <- count(chunk,wordsize=2)
print (chunkCG)
}
但是,我不知道如何保存 returned 的数据。这种方法也不允许我重叠帧。
您的方法没有重叠,因为您用 by = 100
调用它。否则它看起来不错。只需将其更改为 10.
要从您上次尝试中提取数据,请尝试创建将收集数据的字符向量,然后您可以使用名称索引提取正确的计数。
counted_cg <- vector(mode = "character")
for (i in 1:n){
chunk <- data[starts[i]:(starts[i]+99)]
chunkCG <- count(chunk,wordsize=2)
counted_cg <- c(counted_cg, chunkCG["cg"])
}
编辑:要通过 10 次观察滑动 window 获得所需的输出,您可以使用 for
循环。由于我们预先分配了结果向量的大小,因此循环相当快。我认为这是解决问题的最佳方法,因为我不认为很多分组(如果有的话)支持滑动 window:
library(data.table)
set.seed(1)
#Sample data
df<-data.frame(var=sample(c("a","g","t","c"),600,replace=T))
#The number of windows you want, shift by 10 each time
n_windows <- ((nrow(df) - 100) / 10) + 1
#Create empty DF, this helps increase speed of below loop
res <- data.frame(window=rep(NA,n_windows),count_cg=rep(NA,n_windows))
#Loop over each i, paste a leaded version of your sequence onto current sequence and count "cg"s
for (i in 1:n_windows){
res$window[i] <- paste0((i-1)*10 + 1,"-",(i-1)*10 + 100)
subs <- df[((i-1)*10 + 1):((i-1)*10 + 100),"var"]
subs2<- paste0(as.character(subs),as.character(shift(subs,1L,type="lead")[1:length(subs) - 1]))
res$count_cg[i] <- sum(subs2=="cg")
}
head(res)
window count_cg
1 1-100 10
2 11-110 10
3 21-120 8
4 31-130 9
5 41-140 9
6 51-150 9