基于地理空间距离的插补缺失环境数据
Imputation missing environmental data based on geospatial distance
我想通过替换两个最近站点的 Temp 平均值来估算站点 Temp 的缺失值。
library(tidyverse)
library(lubridate)
tb1 <-
tibble::tibble(
Date = as_date(rep(c("2019-01-01", "2019-01-02"), each = 4))
, Stat = rep(c("F", "L", "M", "R"), times = 2)
, Lat = rep(c(31.418715, 31.582045, 30.181459, 33.626057), times = 2)
, Long = rep(c(73.079109, 74.329376, 71.492157, 73.071442), times = 2)
, Temp = c(NA, 20, 28, 25, 26, 25, NA, 24)
)
tb1
# A tibble: 8 x 5
Date Stat Lat Long Temp
<date> <chr> <dbl> <dbl> <dbl>
1 2019-01-01 F 31.4 73.1 NA
2 2019-01-01 L 31.6 74.3 20
3 2019-01-01 M 30.2 71.5 28
4 2019-01-01 R 33.6 73.1 25
5 2019-01-02 F 31.4 73.1 26
6 2019-01-02 L 31.6 74.3 25
7 2019-01-02 M 30.2 71.5 NA
8 2019-01-02 R 33.6 73.1 24
这将用所有站的非缺失温度的平均值替换缺失值。
impute.mean <- function(x) {
replace(x, is.na(x), mean(x, na.rm = TRUE))
}
tb1 %>%
group_by(Date) %>%
mutate(Temp1 = impute.mean(Temp))
# A tibble: 8 x 6
# Groups: Date [2]
Date Stat Lat Long Temp Temp1
<date> <chr> <dbl> <dbl> <dbl> <dbl>
1 2019-01-01 F 31.4 73.1 NA 24.3
2 2019-01-01 L 31.6 74.3 20 20
3 2019-01-01 M 30.2 71.5 28 28
4 2019-01-01 R 33.6 73.1 25 25
5 2019-01-02 F 31.4 73.1 26 26
6 2019-01-02 L 31.6 74.3 25 25
7 2019-01-02 M 30.2 71.5 NA 25
8 2019-01-02 R 33.6 73.1 24 24
使用此代码查找两个站点之间的距离
library(geosphere)
distm(
x = c(73.079109, 31.418715)
, y = c(74.329376, 31.582045)
, fun = distHaversine
)
[,1]
[1,] 120053.3
不知道如何使用 tidyverse 计算距离?
tb1 %>%
mutate(
Dist = distm(
x = c(Long, Lat)
, y = c(Long, Lat)
, fun = distHaversine
)
)
Error in .pointsToMatrix(x) : Wrong length for a vector, should be 2
我在下面添加了一个使用 spatialrisk 包的解决方案。这个包中的关键函数是用 C++ (Rcpp) 编写的,因此速度非常快。
tb1 中没有观测温度的站点:
tb1_na <- tb1 %>% filter(is.na(Temp))
创建函数以确定特定日期到站点的距离:
circle_fn <- function(x, y, z){
spatialrisk::points_in_circle(tb1 %>% filter(Date == z),
lon_center = x,
lat_center = y,
lon = Long,
lat = Lat,
radius = 1e6)
}
由于输出的每个元素都是一个数据框,purrr::map_dfr 用于将它们行绑定在一起:
purrr::pmap_dfr(list(tb1_na$Long, tb1_na$Lat, tb1_na$Date),
circle_fn, .id = "tb1_na") %>%
group_by(tb1_na) %>%
slice(2:3) %>%
summarize(Temp = mean(Temp)) %>%
ungroup() %>%
bind_cols(tb1_na, .) %>%
select(-tb1_na)
输出:
Date Stat Lat Long Temp Temp1
<date> <chr> <dbl> <dbl> <dbl> <dbl>
1 2019-01-01 F 31.4 73.1 NA 24
2 2019-01-02 M 30.2 71.5 NA 25.5
我想通过替换两个最近站点的 Temp 平均值来估算站点 Temp 的缺失值。
library(tidyverse)
library(lubridate)
tb1 <-
tibble::tibble(
Date = as_date(rep(c("2019-01-01", "2019-01-02"), each = 4))
, Stat = rep(c("F", "L", "M", "R"), times = 2)
, Lat = rep(c(31.418715, 31.582045, 30.181459, 33.626057), times = 2)
, Long = rep(c(73.079109, 74.329376, 71.492157, 73.071442), times = 2)
, Temp = c(NA, 20, 28, 25, 26, 25, NA, 24)
)
tb1
# A tibble: 8 x 5
Date Stat Lat Long Temp
<date> <chr> <dbl> <dbl> <dbl>
1 2019-01-01 F 31.4 73.1 NA
2 2019-01-01 L 31.6 74.3 20
3 2019-01-01 M 30.2 71.5 28
4 2019-01-01 R 33.6 73.1 25
5 2019-01-02 F 31.4 73.1 26
6 2019-01-02 L 31.6 74.3 25
7 2019-01-02 M 30.2 71.5 NA
8 2019-01-02 R 33.6 73.1 24
这将用所有站的非缺失温度的平均值替换缺失值。
impute.mean <- function(x) {
replace(x, is.na(x), mean(x, na.rm = TRUE))
}
tb1 %>%
group_by(Date) %>%
mutate(Temp1 = impute.mean(Temp))
# A tibble: 8 x 6
# Groups: Date [2]
Date Stat Lat Long Temp Temp1
<date> <chr> <dbl> <dbl> <dbl> <dbl>
1 2019-01-01 F 31.4 73.1 NA 24.3
2 2019-01-01 L 31.6 74.3 20 20
3 2019-01-01 M 30.2 71.5 28 28
4 2019-01-01 R 33.6 73.1 25 25
5 2019-01-02 F 31.4 73.1 26 26
6 2019-01-02 L 31.6 74.3 25 25
7 2019-01-02 M 30.2 71.5 NA 25
8 2019-01-02 R 33.6 73.1 24 24
使用此代码查找两个站点之间的距离
library(geosphere)
distm(
x = c(73.079109, 31.418715)
, y = c(74.329376, 31.582045)
, fun = distHaversine
)
[,1]
[1,] 120053.3
不知道如何使用 tidyverse 计算距离?
tb1 %>%
mutate(
Dist = distm(
x = c(Long, Lat)
, y = c(Long, Lat)
, fun = distHaversine
)
)
Error in .pointsToMatrix(x) : Wrong length for a vector, should be 2
我在下面添加了一个使用 spatialrisk 包的解决方案。这个包中的关键函数是用 C++ (Rcpp) 编写的,因此速度非常快。
tb1 中没有观测温度的站点:
tb1_na <- tb1 %>% filter(is.na(Temp))
创建函数以确定特定日期到站点的距离:
circle_fn <- function(x, y, z){
spatialrisk::points_in_circle(tb1 %>% filter(Date == z),
lon_center = x,
lat_center = y,
lon = Long,
lat = Lat,
radius = 1e6)
}
由于输出的每个元素都是一个数据框,purrr::map_dfr 用于将它们行绑定在一起:
purrr::pmap_dfr(list(tb1_na$Long, tb1_na$Lat, tb1_na$Date),
circle_fn, .id = "tb1_na") %>%
group_by(tb1_na) %>%
slice(2:3) %>%
summarize(Temp = mean(Temp)) %>%
ungroup() %>%
bind_cols(tb1_na, .) %>%
select(-tb1_na)
输出:
Date Stat Lat Long Temp Temp1
<date> <chr> <dbl> <dbl> <dbl> <dbl>
1 2019-01-01 F 31.4 73.1 NA 24
2 2019-01-02 M 30.2 71.5 NA 25.5