计算数据框中给定的基因组区域周围的出现次数

Question

我需要计算基因组中发生在特定位置或范围内的突变。突变具有基因组位置（染色体和碱基对，例如 Chr1、10658324）。范围或点分别定义为基因组中给定位置上游和下游 (+-) 的 10000 个碱基对。突变的位置和 "spots" 的位置都存储在数据框中。

示例：

set.seed(1)

Chr <- 1
Pos <- as.integer(runif(5000 , 0, 1e8))
mutations <- data.frame(Pos, Chr)

Chr <- 1
Pos <- as.integer(runif(50 , 0, 1e8))
spots <- data.frame(Pos, Chr)

所以我要问的问题是："spots" 中给出的位置周围 +-10k 碱基对存在多少突变。（例如，如果现货为 100k，则范围为 90k-110k）真实数据当然会包含所有 24 条染色体，但为了简单起见，我们现在可以只关注一条染色体。最终数据应包含 "spot" 及其附近的突变数量，最好是在数据框或矩阵中。

非常感谢您的任何建议或帮助！

这是第一次尝试，但我很确定有一种更优雅的方法。

w <- 10000  #setting range to 10k basepairs
loop <- spots$Pos  #creating vector of positions to loop through
out <- data.frame(0,0)
colnames(out) <- c("Pos", "Count")

for (l in loop) {
  temp <- nrow(filter(mutations, Pos>=l-w, Pos<=l+w))
  temp2 <- cbind(l,temp)
  colnames(temp2) <- c("Pos", "Count")
  out <- rbind(out, temp2)
}
 out <- out[-1,]

Answer 1

我不熟悉 R 中的床操作，所以我打算用 bedtools 提出一个答案，这里有人可以尝试转换为 GRanges 或其他 R 生物信息学库。

本质上，你有两个床文件，一个有你的斑点，另一个有你的突变（我假设后者每个都有 1bp 坐标）。在这种情况下，您将使用 closestBed 获取最近的点和每个突变的 bp 距离，然后过滤掉距离点 10KB 的点。 UNIX 环境中的代码如下所示：

# Assuming 4-column file structure (chr start end name)
closestBed -d -a mutations.bed -b spots.bed | awk ' <= 10000 {print}'

第 9 列 () 是距最近点的距离（以 bp 为单位）。根据您想要的具体程度，您可以查看 http://bedtools.readthedocs.io/en/latest/content/tools/closest.html 处的手册页。我很确定 R 中至少有一个类似 bedtools 的包。如果功能相似，您可以应用完全相同的解决方案。

希望对您有所帮助！

Answer 2

使用 data.table 重叠，然后聚合：

library(data.table)
#set the flank
myFlank <- 100000

#convert to ranges with flank
spotsRange <- data.table(
  chr = spots$Chr,
  start = spots$Pos - myFlank,
  end = spots$Pos + myFlank,
  posSpot = spots$Pos,
  key = c("chr", "start", "end"))

#convert to ranges start end same as pos
mutationsRange <- data.table(
  chr = mutations$Chr,
  start = mutations$Pos,
  end = mutations$Pos,
  key = c("chr", "start", "end"))

#merge by overlap
res <- foverlaps(mutationsRange, spotsRange, nomatch = 0)

#count mutations
resCnt <- data.frame(table(res$posSpot))
colnames(resCnt) <- c("Pos", "MutationCount")
merge(spots, resCnt, by = "Pos")
#         Pos Chr MutationCount
# 1   3439618   1            10
# 2   3549952   1            15
# 3   4375314   1            11
# 4   7337370   1            13
# ...

计算数据框中给定的基因组区域周围的出现次数

Count occurrences around a genomic region given in a data frame

r

bioinformatics