If/else if: 只在 R 中第一个条件不满足时才在设定距离内选择第一个匹配记录
If/else if: pick first matching record within set distance only after first condition is not met in R
我想仅在不满足第一个搜索条件后才选择设定距离内最近的前任所有者。这些位置称为 reflo
(参考位置),并且它们具有相应的 x 和 y 坐标(分别称为 locx
和 locy
)。
条件:
- 如果
lifetime_census$reflo==owners$reflo.x[i]
则满足条件
- 如果
lifetime_census$reflo!=owners$reflo.x[i]
,则查找下一个最近的记录(30米以内)
- 如果30米内没有记录,则赋值
NA
以前的所有者 (>20,000) 存储在名为 lifetime_census
的数据集中。这是数据示例:
id previous_id reflo locx locy lifespan
16161 5587 -310 -3 10 1810
16848 5101 Q1 17.3 0.8 55
21815 6077 M2 13 1.8 979
23938 6130 -49 -4 9 374
29615 7307 B.1 2.5 1 1130
然后我有一个 owners
数据集(这里是一个示例):
squirrel_id spr_census reflo.x spring_locx spring_locy
6391 2005 M3 13 2.5
6130 2005 -310 -3 10
23586 2019 B9 2 9
为了说明我正在努力实现的目标:
squirrel_id spr_census reflo.x spring_locx spring_locy previous_owner
6391 2004 M3 13 2.5 6077
6130 2005 -310 -3 10 5587
23586 2019 B9 2 9 NA
目前我试过的是这样的:
n <- length(owners$squirrel_id)
distance <- 30 #This can be easily changed to bigger or smaller values
for(i in 1:n) {
last_owner <- subset(lifetime_census,
lifetime_census$reflo==owners$reflo.x[i] & #using the exact location
((30*owners$spring_locx[i]-30* lifetime_census$locx)^2+(30* owners$spring_locy[i]-30* lifetime_census$locy)^2<=(distance)^2)) #this sets the search limit
owners[i,"previous_owner"] <- last_owner$previous_id[i]
}
我不知道如何让循环按顺序遍历条件,然后 select 只有在没有找到完全匹配的情况下才在搜索限制内记录。
有什么想法吗?
我建议这样(假设 locx
等单位与 distance
相同:
distance = 30
distance_xy = function (x1, y1, x2, y2) {
sqrt((x2 - x1)^2 + (y2 -y1)^2)
}
for (i in 1:dim(owners)[1]) {
if (owners$reflo.x[i] %in% lifetime_census$reflo) {
owners$previous_owner[i] = lifetime_census[lifetime_census$reflo == owners$reflo.x[i], ]$previous_id
} else {
dt = distance_xy(owners$spring_locx[i], owners$spring_locy[i], lifetime_census$locx, lifetime_census$locy)
if (any(dt <= distance)) {
owners$previous_owner[i] = lifetime_census[order(dt), ]$previous_id[1L]
} else {
owners$previous_id[i] = NA
}
}
}
给出:
squirrel_id spr_census reflo.x spring_locx spring_locy previous_owner
1 6391 2005 M3 13 2.5 6077
2 6130 2005 -310 -3 10.0 5587
3 23586 2019 B9 2 9.0 5587
请注意,如果 reflo
有多个匹配项,这将失败。
[编辑] 根据下面的评论添加替代方案。
当您开始添加条件时,if
-else
语句可能会变得非常混乱。这是避免上述嵌套结构的另一种实现方式:
for (i in 1:dim(owners)[1]) {
# if we find the reflo
if (owners$reflo.x[i] %in% lifetime_census$reflo) {
owners$previous_owner[i] = lifetime_census[lifetime_census$reflo == owners$reflo.x[i], ]$previous_id
next
}
# if we got here, then we didn't find the reflo, compute distances:
dt = distance_xy(owners$spring_locx[i], owners$spring_locy[i], lifetime_census$locx, lifetime_census$locy)
# if we find anyone within distance, get the closest one
if (any(dt <= distance)) {
owners$previous_owner[i] = lifetime_census[order(dt), ]$previous_id[1L]
next
}
# if we got here, there was nobody within range, set NA and move on:
owners$previous_id[i] = NA
}
代码的作用完全相同,但是通过利用 for
循环和 next
可以删除每个 else
和孔嵌套结构。
由于您有两套标准,我建议您也将任务分成两部分。此外,在组合两个数据帧时,我总是建议找到合适的连接。
对于完全匹配,dplyr::inner_join
将为您提供正确的行。
对于下一部分,您可以排除完全匹配并使用 fuzzyjoin
包中的 distance_left_join
来匹配剩余的行。它还带有最大距离选项。
然后,你可以简单地绑定两个结果
library(data.table)
lifetime_census <- fread('id previous_id reflo locx locy lifespan
16161 5587 -310 -3 10 1810
16848 5101 Q1 17.3 0.8 55
21815 6077 M2 13 1.8 979
23938 6130 -49 -4 9 374
29615 7307 B.1 3 1 1130')
lifetime_census
#> id previous_id reflo locx locy lifespan
#> 1: 16161 5587 -310 -3.0 10.0 1810
#> 2: 16848 5101 Q1 17.3 0.8 55
#> 3: 21815 6077 M2 13.0 1.8 979
#> 4: 23938 6130 -49 -4.0 9.0 374
#> 5: 29615 7307 B.1 3.0 1.0 1130
owners <- fread('squirrel_id spr_census reflo.x spring_locx spring_locy
6391 2005 M3 13 2.5
6130 2005 -310 -3 10
23586 2019 B9 2 9')
owners
#> squirrel_id spr_census reflo.x spring_locx spring_locy
#> 1: 6391 2005 M3 13 2.5
#> 2: 6130 2005 -310 -3 10.0
#> 3: 23586 2019 B9 2 9.0
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:data.table':
#>
#> between, first, last
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(fuzzyjoin)
# Search for exact match
df1 <- inner_join(owners,lifetime_census ,by=c(reflo.x='reflo')) %>%
select(squirrel_id:spring_locy,previous_id)
df1
#> squirrel_id spr_census reflo.x spring_locx spring_locy previous_id
#> 1 6130 2005 -310 -3 10 5587
df2 <-
owners %>%
anti_join(df1,by=c('squirrel_id')) %>% # Remove rows with exact matches
distance_left_join(lifetime_census,
by=c(spring_locx='locx',spring_locy='locy'), # Match columns
max_dist=1, # Since you want a maximum distance of 30m = 1 unit
distance_col='dist') %>% # Optional, if you want to see the distance
select(squirrel_id:spring_locy,previous_id,dist)
bind_rows(df1,df2)
#> squirrel_id spr_census reflo.x spring_locx spring_locy previous_id dist
#> 1 6130 2005 -310 -3 10.0 5587 NA
#> 2 6391 2005 M3 13 2.5 6077 0.7
#> 3 23586 2019 B9 2 9.0 NA NA
由 reprex package (v0.3.0)
于 2020 年 3 月 2 日创建
以下解决问题。
计算距离的函数:
distance_xy = function (x1, y1, x2, y2) {
sqrt((x2 - x1)^2 + (y2 -y1)^2)
}
判断30米范围内的上一个id。如果所有距离都大于 30 米,请将 id 设置为 NA。
library(tidyverse)
previous_id_fn <- function(v, w, years){
dists <- map2_dbl(lifetime_census$locx, lifetime_census$locy, ~distance_xy(.x, .y, v, w))
df <- data.frame(previous = lifetime_census$previous_id,
dist = dists,
life = lifetime_census$lifespan) %>%
filter(life < years)
id <- df$previous[[which.min(df$dist)]]
if (min(df$dist, na.rm = TRUE) > 30) { id <- NA }
return(id)
}
首先加入 data.frame 拥有 data.frame lifetime_census 的所有者以获得 previous_id 的列。然后将上面定义的函数应用于 data.frame 的每一行。
owners %>%
left_join(., lifetime_census, by = c("reflo.x" = "reflo")) %>%
select(squirrel_id:spring_locy, previous_id) %>%
rowwise() %>%
mutate(previous_id = ifelse(is.na(previous_id),
previous_id_fn(spring_locx, spring_locy, 1000),
previous_id))
编辑:
我向函数 previous_id_fn() 添加了一个参数 years。如果 lifetime > years 函数现在 returns NA。
我想仅在不满足第一个搜索条件后才选择设定距离内最近的前任所有者。这些位置称为 reflo
(参考位置),并且它们具有相应的 x 和 y 坐标(分别称为 locx
和 locy
)。
条件:
- 如果
lifetime_census$reflo==owners$reflo.x[i]
则满足条件 - 如果
lifetime_census$reflo!=owners$reflo.x[i]
,则查找下一个最近的记录(30米以内) - 如果30米内没有记录,则赋值
NA
以前的所有者 (>20,000) 存储在名为 lifetime_census
的数据集中。这是数据示例:
id previous_id reflo locx locy lifespan
16161 5587 -310 -3 10 1810
16848 5101 Q1 17.3 0.8 55
21815 6077 M2 13 1.8 979
23938 6130 -49 -4 9 374
29615 7307 B.1 2.5 1 1130
然后我有一个 owners
数据集(这里是一个示例):
squirrel_id spr_census reflo.x spring_locx spring_locy
6391 2005 M3 13 2.5
6130 2005 -310 -3 10
23586 2019 B9 2 9
为了说明我正在努力实现的目标:
squirrel_id spr_census reflo.x spring_locx spring_locy previous_owner
6391 2004 M3 13 2.5 6077
6130 2005 -310 -3 10 5587
23586 2019 B9 2 9 NA
目前我试过的是这样的:
n <- length(owners$squirrel_id)
distance <- 30 #This can be easily changed to bigger or smaller values
for(i in 1:n) {
last_owner <- subset(lifetime_census,
lifetime_census$reflo==owners$reflo.x[i] & #using the exact location
((30*owners$spring_locx[i]-30* lifetime_census$locx)^2+(30* owners$spring_locy[i]-30* lifetime_census$locy)^2<=(distance)^2)) #this sets the search limit
owners[i,"previous_owner"] <- last_owner$previous_id[i]
}
我不知道如何让循环按顺序遍历条件,然后 select 只有在没有找到完全匹配的情况下才在搜索限制内记录。
有什么想法吗?
我建议这样(假设 locx
等单位与 distance
相同:
distance = 30
distance_xy = function (x1, y1, x2, y2) {
sqrt((x2 - x1)^2 + (y2 -y1)^2)
}
for (i in 1:dim(owners)[1]) {
if (owners$reflo.x[i] %in% lifetime_census$reflo) {
owners$previous_owner[i] = lifetime_census[lifetime_census$reflo == owners$reflo.x[i], ]$previous_id
} else {
dt = distance_xy(owners$spring_locx[i], owners$spring_locy[i], lifetime_census$locx, lifetime_census$locy)
if (any(dt <= distance)) {
owners$previous_owner[i] = lifetime_census[order(dt), ]$previous_id[1L]
} else {
owners$previous_id[i] = NA
}
}
}
给出:
squirrel_id spr_census reflo.x spring_locx spring_locy previous_owner
1 6391 2005 M3 13 2.5 6077
2 6130 2005 -310 -3 10.0 5587
3 23586 2019 B9 2 9.0 5587
请注意,如果 reflo
有多个匹配项,这将失败。
[编辑] 根据下面的评论添加替代方案。
当您开始添加条件时,if
-else
语句可能会变得非常混乱。这是避免上述嵌套结构的另一种实现方式:
for (i in 1:dim(owners)[1]) {
# if we find the reflo
if (owners$reflo.x[i] %in% lifetime_census$reflo) {
owners$previous_owner[i] = lifetime_census[lifetime_census$reflo == owners$reflo.x[i], ]$previous_id
next
}
# if we got here, then we didn't find the reflo, compute distances:
dt = distance_xy(owners$spring_locx[i], owners$spring_locy[i], lifetime_census$locx, lifetime_census$locy)
# if we find anyone within distance, get the closest one
if (any(dt <= distance)) {
owners$previous_owner[i] = lifetime_census[order(dt), ]$previous_id[1L]
next
}
# if we got here, there was nobody within range, set NA and move on:
owners$previous_id[i] = NA
}
代码的作用完全相同,但是通过利用 for
循环和 next
可以删除每个 else
和孔嵌套结构。
由于您有两套标准,我建议您也将任务分成两部分。此外,在组合两个数据帧时,我总是建议找到合适的连接。
对于完全匹配,dplyr::inner_join
将为您提供正确的行。
对于下一部分,您可以排除完全匹配并使用 fuzzyjoin
包中的 distance_left_join
来匹配剩余的行。它还带有最大距离选项。
然后,你可以简单地绑定两个结果
library(data.table)
lifetime_census <- fread('id previous_id reflo locx locy lifespan
16161 5587 -310 -3 10 1810
16848 5101 Q1 17.3 0.8 55
21815 6077 M2 13 1.8 979
23938 6130 -49 -4 9 374
29615 7307 B.1 3 1 1130')
lifetime_census
#> id previous_id reflo locx locy lifespan
#> 1: 16161 5587 -310 -3.0 10.0 1810
#> 2: 16848 5101 Q1 17.3 0.8 55
#> 3: 21815 6077 M2 13.0 1.8 979
#> 4: 23938 6130 -49 -4.0 9.0 374
#> 5: 29615 7307 B.1 3.0 1.0 1130
owners <- fread('squirrel_id spr_census reflo.x spring_locx spring_locy
6391 2005 M3 13 2.5
6130 2005 -310 -3 10
23586 2019 B9 2 9')
owners
#> squirrel_id spr_census reflo.x spring_locx spring_locy
#> 1: 6391 2005 M3 13 2.5
#> 2: 6130 2005 -310 -3 10.0
#> 3: 23586 2019 B9 2 9.0
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:data.table':
#>
#> between, first, last
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(fuzzyjoin)
# Search for exact match
df1 <- inner_join(owners,lifetime_census ,by=c(reflo.x='reflo')) %>%
select(squirrel_id:spring_locy,previous_id)
df1
#> squirrel_id spr_census reflo.x spring_locx spring_locy previous_id
#> 1 6130 2005 -310 -3 10 5587
df2 <-
owners %>%
anti_join(df1,by=c('squirrel_id')) %>% # Remove rows with exact matches
distance_left_join(lifetime_census,
by=c(spring_locx='locx',spring_locy='locy'), # Match columns
max_dist=1, # Since you want a maximum distance of 30m = 1 unit
distance_col='dist') %>% # Optional, if you want to see the distance
select(squirrel_id:spring_locy,previous_id,dist)
bind_rows(df1,df2)
#> squirrel_id spr_census reflo.x spring_locx spring_locy previous_id dist
#> 1 6130 2005 -310 -3 10.0 5587 NA
#> 2 6391 2005 M3 13 2.5 6077 0.7
#> 3 23586 2019 B9 2 9.0 NA NA
由 reprex package (v0.3.0)
于 2020 年 3 月 2 日创建以下解决问题。
计算距离的函数:
distance_xy = function (x1, y1, x2, y2) {
sqrt((x2 - x1)^2 + (y2 -y1)^2)
}
判断30米范围内的上一个id。如果所有距离都大于 30 米,请将 id 设置为 NA。
library(tidyverse)
previous_id_fn <- function(v, w, years){
dists <- map2_dbl(lifetime_census$locx, lifetime_census$locy, ~distance_xy(.x, .y, v, w))
df <- data.frame(previous = lifetime_census$previous_id,
dist = dists,
life = lifetime_census$lifespan) %>%
filter(life < years)
id <- df$previous[[which.min(df$dist)]]
if (min(df$dist, na.rm = TRUE) > 30) { id <- NA }
return(id)
}
首先加入 data.frame 拥有 data.frame lifetime_census 的所有者以获得 previous_id 的列。然后将上面定义的函数应用于 data.frame 的每一行。
owners %>%
left_join(., lifetime_census, by = c("reflo.x" = "reflo")) %>%
select(squirrel_id:spring_locy, previous_id) %>%
rowwise() %>%
mutate(previous_id = ifelse(is.na(previous_id),
previous_id_fn(spring_locx, spring_locy, 1000),
previous_id))
编辑:
我向函数 previous_id_fn() 添加了一个参数 years。如果 lifetime > years 函数现在 returns NA。