如何根据 R 中另一个文件的多个条件 select 行文件?
How to select lines of file based on multiple conditions of another file in R?
我有 2 个遗传数据集。我根据 file2 中的列过滤 file1。但是,我还需要考虑 file2 中的第二列,但我不确定该怎么做。
文件 1 行提取的条件是,仅选择染色体位置比文件 2 中同一染色体上的任何染色体位置大 5000 以上或小 5000 以上的行。
例如我的数据如下:
文件 1:
Variant Chromsome Chromosome Position
Variant1 2 14000
Variant2 1 9000
Variant3 8 37000
Variant4 1 21000
文件 2:
Variant Chromosome Chromosome Position
Variant1 1 10000
Variant2 1 20000
Variant3 8 30000
预期输出(与同一染色体上文件 2 的任何行相比,位置距离大于 +/-5000 的变体):
Variant Chromosome Position Chromosome
Variant1 14000 2
Variant3 37000 8
#Variant1 at 14000, whilst within 5000 + of Variant1 at 10000 in file2 is on a different chromosome and therefore not compared and is kept.
#Variant3 is on the same chromosome as Variant4 in file1 but larger than 5000+ distance and is kept.
我试过使用 unix 编码,但是在没有考虑染色体的情况下,每个变体只能得到大于 5000 +/- 的过滤,并且被建议尝试在 R 中编码,但是我是 R 的新手而且我是不知道从哪里开始。我假设我需要一个 "if line of file1 has matching chromosome number as file2, then perform the larger than 5000 +/- filtering within that chromosome number only" 的 if 语句和一个用于遍历每一行的 for 循环 - 即使只是关于如何学习如何做到这一点的指导也会受到赞赏。
使用您的示例数据和方法,我想出了这个 data.table
-解决方案
代码中注释了一个简短的解释。
library( data.table)
#sample data
dt1 <- fread("Variant Chromosome Chromosome_Position
Variant1 2 14000
Variant2 1 9000
Variant3 8 37000
Variant4 1 21000")
dt2 <- fread("Variant Chromosome Chromosome_Position
Variant1 1 10000
Variant2 1 20000
Variant3 8 30000")
#create lower&upper boundaries for dt2 chromosome position
dt2[, c("low", "high") := .(Chromosome_Position - 5000, Chromosome_Position + 5000)]
#dt2 now looks like this:
#-------------------------------------------------------------
# Variant Chromosome Chromosome_Position low high
# 1: Variant1 1 10000 5000 15000
# 2: Variant2 1 20000 15000 25000
# 3: Variant3 8 30000 25000 35000
#find matches on chromosome, with position bewtene low-high
# this is done using a non-equi join using the lower and upper boundaries
# created in dt2 in the previous line.
# on = .(...) means: Chromosome in dt1 and dt2 have to be the same
# Chromosome_Position in dt1 has to be between
# low and high of dt2. Y
# You can (of course) use >= and <= if desired.
# match := i.Variant creates a new column in dt1, with the value of
# Variant from dt2 (if a match is found).
# If no match is found, the columns gets a <NA>.
dt1[ dt2, match := i.Variant,
on = .(Chromosome, Chromosome_Position > low, Chromosome_Position < high ) ]
#dt1 now looks like this
#see the match-column for found dt1-matches in dt2
#-------------------------------------------------------------
# Variant Chromosome Chromosome_Position match
# 1: Variant1 2 14000 <NA>
# 2: Variant2 1 9000 Variant1
# 3: Variant3 8 37000 <NA>
# 4: Variant4 1 21000 Variant2
#discard all found matches (i.e. is.na(Match) == TRUE), and drop match-column,
# since we no longer need it.
dt1[ is.na(match) ][, match := NULL ][]
# Variant Chromosome Chromosome_Position
# 1: Variant1 2 14000
# 2: Variant3 8 37000
我有 2 个遗传数据集。我根据 file2 中的列过滤 file1。但是,我还需要考虑 file2 中的第二列,但我不确定该怎么做。
文件 1 行提取的条件是,仅选择染色体位置比文件 2 中同一染色体上的任何染色体位置大 5000 以上或小 5000 以上的行。
例如我的数据如下:
文件 1:
Variant Chromsome Chromosome Position
Variant1 2 14000
Variant2 1 9000
Variant3 8 37000
Variant4 1 21000
文件 2:
Variant Chromosome Chromosome Position
Variant1 1 10000
Variant2 1 20000
Variant3 8 30000
预期输出(与同一染色体上文件 2 的任何行相比,位置距离大于 +/-5000 的变体):
Variant Chromosome Position Chromosome
Variant1 14000 2
Variant3 37000 8
#Variant1 at 14000, whilst within 5000 + of Variant1 at 10000 in file2 is on a different chromosome and therefore not compared and is kept.
#Variant3 is on the same chromosome as Variant4 in file1 but larger than 5000+ distance and is kept.
我试过使用 unix 编码,但是在没有考虑染色体的情况下,每个变体只能得到大于 5000 +/- 的过滤,并且被建议尝试在 R 中编码,但是我是 R 的新手而且我是不知道从哪里开始。我假设我需要一个 "if line of file1 has matching chromosome number as file2, then perform the larger than 5000 +/- filtering within that chromosome number only" 的 if 语句和一个用于遍历每一行的 for 循环 - 即使只是关于如何学习如何做到这一点的指导也会受到赞赏。
使用您的示例数据和方法,我想出了这个 data.table
-解决方案
代码中注释了一个简短的解释。
library( data.table)
#sample data
dt1 <- fread("Variant Chromosome Chromosome_Position
Variant1 2 14000
Variant2 1 9000
Variant3 8 37000
Variant4 1 21000")
dt2 <- fread("Variant Chromosome Chromosome_Position
Variant1 1 10000
Variant2 1 20000
Variant3 8 30000")
#create lower&upper boundaries for dt2 chromosome position
dt2[, c("low", "high") := .(Chromosome_Position - 5000, Chromosome_Position + 5000)]
#dt2 now looks like this:
#-------------------------------------------------------------
# Variant Chromosome Chromosome_Position low high
# 1: Variant1 1 10000 5000 15000
# 2: Variant2 1 20000 15000 25000
# 3: Variant3 8 30000 25000 35000
#find matches on chromosome, with position bewtene low-high
# this is done using a non-equi join using the lower and upper boundaries
# created in dt2 in the previous line.
# on = .(...) means: Chromosome in dt1 and dt2 have to be the same
# Chromosome_Position in dt1 has to be between
# low and high of dt2. Y
# You can (of course) use >= and <= if desired.
# match := i.Variant creates a new column in dt1, with the value of
# Variant from dt2 (if a match is found).
# If no match is found, the columns gets a <NA>.
dt1[ dt2, match := i.Variant,
on = .(Chromosome, Chromosome_Position > low, Chromosome_Position < high ) ]
#dt1 now looks like this
#see the match-column for found dt1-matches in dt2
#-------------------------------------------------------------
# Variant Chromosome Chromosome_Position match
# 1: Variant1 2 14000 <NA>
# 2: Variant2 1 9000 Variant1
# 3: Variant3 8 37000 <NA>
# 4: Variant4 1 21000 Variant2
#discard all found matches (i.e. is.na(Match) == TRUE), and drop match-column,
# since we no longer need it.
dt1[ is.na(match) ][, match := NULL ][]
# Variant Chromosome Chromosome_Position
# 1: Variant1 2 14000
# 2: Variant3 8 37000