R:如果位置在间隔内,则将列添加到数据框
R: add column to dataframe if position falls within intervals
我有 2 个文件:
"query.tab"
grp pos
1 10
1 45
2 6
3 12
"data.tab"
grp start end info
1 1 15 blue
1 23 60 red
2 1 40 green
3 20 30 black
我正在尝试将 $info
从文件 "data" 添加到文件 "query" 仅当
$grp
来自 "query" 匹配 $grp
来自 "data"
$pos
来自 query.tab
介于 $start
和来自 data.tab
的 $end
之间。
为了获得:
grp pos info
1 10 blue
1 45 red
2 6 green
3 12 NA
(N.B.: 不重叠 $info
可以是 'NA' 或者空白,没关系,反正不应该发生)
到目前为止,我正在使用 findOverlaps()
,但无法理解如何操作其输出:
library(IRanges)
query =data.frame(grp = as.numeric(c("1", "1", "2", "3")), pos = as.numeric(c("10", "45", "6", "12")))
data = data.frame(grp=as.numeric(c("1", "1", "2", "3")), start=as.numeric(c("1", "23", "1", "20")), end=as.numeric(c("15", "60", "40", "30")), info=c("blue", "red", "green", "black"))
query.ir <- IRanges(start = query$pos, end = query$pos, names = query$grp)
data.ir <- IRanges(start = data$start, end = data$end, names = data$grp)
o <- findOverlaps(query.ir, data.ir, type = "within")
o
Hits object with 7 hits and 0 metadata columns:
queryHits subjectHits
<integer> <integer>
[1] 1 3
[2] 1 1
[3] 2 2
[4] 3 3
[5] 3 1
[6] 4 3
[7] 4 1
-------
queryLength: 4 / subjectLength: 4
我可以从此输出中检索 $info
字段吗?还是我走错路了?
根据您提供的所需输出,我认为这可行。也可以总结一下,但我更喜欢这个版本,以免造成混淆;
#merge two data.frame to get info for all groups and positions
df <- merge(query.tab,data.tab, by = "grp")
#get the rows that are not duplicated but may be non-overlapping
#first non-duplicates
#second non-overlapping
#third remove info and replace by NA as it's non-overlapping rows
df.nas <- df[!(duplicated(df[,c(1,2)]) | duplicated(df[,c(1,2)], fromLast = TRUE)), ]
df.nas <- df.nas[df.nas$pos>df.nas$end | df.nas$pos<df.nas$start, ]
df.nas$info <- NA
#only keep the rows that are overlapping (position between start and end)
df.cnd <- df[df$pos<=df$end & df$pos>=df$start, ]
#merge overlapped and non-overlapped data.frames
df.mrg <- rbind(df.cnd, df.nas)
#remove start and end columns and sort based on group and position
df.final <- df.mrg[with(df.mrg,order(grp, pos)),c(1,2,5)]
#output:
df.final
# grp pos info
# 1 1 10 blue
# 4 1 45 red
# 5 2 6 green
# 6 3 12 <NA>
数据:
read.table(text='grp pos
1 10
1 45
2 6
3 12', header=TRUE, quote='"') -> query.tab
read.table(text='grp start end info
1 1 15 blue
1 23 60 red
2 1 40 green
3 20 30 black', header=TRUE, quote='"') -> data.tab
我有 2 个文件:
"query.tab"
grp pos
1 10
1 45
2 6
3 12
"data.tab"
grp start end info
1 1 15 blue
1 23 60 red
2 1 40 green
3 20 30 black
我正在尝试将 $info
从文件 "data" 添加到文件 "query" 仅当
$grp
来自 "query" 匹配$grp
来自 "data"$pos
来自query.tab
介于$start
和来自data.tab
的$end
之间。
为了获得:
grp pos info
1 10 blue
1 45 red
2 6 green
3 12 NA
(N.B.: 不重叠 $info
可以是 'NA' 或者空白,没关系,反正不应该发生)
到目前为止,我正在使用 findOverlaps()
,但无法理解如何操作其输出:
library(IRanges)
query =data.frame(grp = as.numeric(c("1", "1", "2", "3")), pos = as.numeric(c("10", "45", "6", "12")))
data = data.frame(grp=as.numeric(c("1", "1", "2", "3")), start=as.numeric(c("1", "23", "1", "20")), end=as.numeric(c("15", "60", "40", "30")), info=c("blue", "red", "green", "black"))
query.ir <- IRanges(start = query$pos, end = query$pos, names = query$grp)
data.ir <- IRanges(start = data$start, end = data$end, names = data$grp)
o <- findOverlaps(query.ir, data.ir, type = "within")
o
Hits object with 7 hits and 0 metadata columns:
queryHits subjectHits
<integer> <integer>
[1] 1 3
[2] 1 1
[3] 2 2
[4] 3 3
[5] 3 1
[6] 4 3
[7] 4 1
-------
queryLength: 4 / subjectLength: 4
我可以从此输出中检索 $info
字段吗?还是我走错路了?
根据您提供的所需输出,我认为这可行。也可以总结一下,但我更喜欢这个版本,以免造成混淆;
#merge two data.frame to get info for all groups and positions
df <- merge(query.tab,data.tab, by = "grp")
#get the rows that are not duplicated but may be non-overlapping
#first non-duplicates
#second non-overlapping
#third remove info and replace by NA as it's non-overlapping rows
df.nas <- df[!(duplicated(df[,c(1,2)]) | duplicated(df[,c(1,2)], fromLast = TRUE)), ]
df.nas <- df.nas[df.nas$pos>df.nas$end | df.nas$pos<df.nas$start, ]
df.nas$info <- NA
#only keep the rows that are overlapping (position between start and end)
df.cnd <- df[df$pos<=df$end & df$pos>=df$start, ]
#merge overlapped and non-overlapped data.frames
df.mrg <- rbind(df.cnd, df.nas)
#remove start and end columns and sort based on group and position
df.final <- df.mrg[with(df.mrg,order(grp, pos)),c(1,2,5)]
#output:
df.final
# grp pos info
# 1 1 10 blue
# 4 1 45 red
# 5 2 6 green
# 6 3 12 <NA>
数据:
read.table(text='grp pos
1 10
1 45
2 6
3 12', header=TRUE, quote='"') -> query.tab
read.table(text='grp start end info
1 1 15 blue
1 23 60 red
2 1 40 green
3 20 30 black', header=TRUE, quote='"') -> data.tab