基于地理空间距离的插补缺失环境数据

Imputation missing environmental data based on geospatial distance

我想通过替换两个最近站点的 Temp 平均值来估算站点 Temp 的缺失值。

library(tidyverse)
library(lubridate)

tb1 <- 
  tibble::tibble(
    Date = as_date(rep(c("2019-01-01", "2019-01-02"), each = 4))
  , Stat = rep(c("F", "L", "M", "R"), times = 2)
  , Lat  = rep(c(31.418715, 31.582045, 30.181459, 33.626057), times = 2)
  , Long = rep(c(73.079109, 74.329376, 71.492157, 73.071442), times = 2)
  , Temp = c(NA, 20, 28, 25, 26, 25, NA, 24)
  )

tb1

# A tibble: 8 x 5
  Date       Stat    Lat  Long  Temp
  <date>     <chr> <dbl> <dbl> <dbl>
1 2019-01-01 F      31.4  73.1    NA
2 2019-01-01 L      31.6  74.3    20
3 2019-01-01 M      30.2  71.5    28
4 2019-01-01 R      33.6  73.1    25
5 2019-01-02 F      31.4  73.1    26
6 2019-01-02 L      31.6  74.3    25
7 2019-01-02 M      30.2  71.5    NA
8 2019-01-02 R      33.6  73.1    24

这将用所有站的非缺失温度的平均值替换缺失值。

impute.mean <- function(x) {
  replace(x, is.na(x), mean(x, na.rm = TRUE))
  }

tb1 %>% 
  group_by(Date) %>% 
  mutate(Temp1 = impute.mean(Temp))

# A tibble: 8 x 6
# Groups:   Date [2]
  Date       Stat    Lat  Long  Temp Temp1
  <date>     <chr> <dbl> <dbl> <dbl> <dbl>
1 2019-01-01 F      31.4  73.1    NA  24.3
2 2019-01-01 L      31.6  74.3    20  20  
3 2019-01-01 M      30.2  71.5    28  28  
4 2019-01-01 R      33.6  73.1    25  25  
5 2019-01-02 F      31.4  73.1    26  26  
6 2019-01-02 L      31.6  74.3    25  25  
7 2019-01-02 M      30.2  71.5    NA  25  
8 2019-01-02 R      33.6  73.1    24  24

使用此代码查找两个站点之间的距离

library(geosphere)

distm(
    x = c(73.079109, 31.418715)
  , y = c(74.329376, 31.582045)
  , fun = distHaversine
  )

         [,1]
[1,] 120053.3

不知道如何使用 tidyverse 计算距离?

tb1 %>% 
  mutate(
    Dist = distm(
          x = c(Long, Lat)
        , y = c(Long, Lat)
        , fun = distHaversine
        )
    )

Error in .pointsToMatrix(x) : Wrong length for a vector, should be 2

我在下面添加了一个使用 spatialrisk 包的解决方案。这个包中的关键函数是用 C++ (Rcpp) 编写的,因此速度非常快。

tb1 中没有观测温度的站点:

tb1_na <- tb1 %>% filter(is.na(Temp)) 

创建函数以确定特定日期到站点的距离:

circle_fn <- function(x, y, z){
  spatialrisk::points_in_circle(tb1 %>% filter(Date == z), 
                                lon_center = x, 
                                lat_center = y, 
                                lon = Long, 
                                lat = Lat, 
                                radius = 1e6)
}

由于输出的每个元素都是一个数据框,purrr::map_dfr 用于将它们行绑定在一起:

purrr::pmap_dfr(list(tb1_na$Long, tb1_na$Lat, tb1_na$Date), 
                circle_fn, .id = "tb1_na") %>%
   group_by(tb1_na) %>%
   slice(2:3) %>%
   summarize(Temp = mean(Temp)) %>%
   ungroup() %>%
   bind_cols(tb1_na, .) %>%
   select(-tb1_na)

输出:

  Date       Stat    Lat  Long  Temp Temp1
  <date>     <chr> <dbl> <dbl> <dbl> <dbl>
1 2019-01-01 F      31.4  73.1    NA  24  
2 2019-01-02 M      30.2  71.5    NA  25.5