根据 R 中的空间邻域和时间标准将行分配给一个组

Question

我有一个似乎无法解决的问题。我有一个从 arcgis 中的栅格派生的数据集。该数据集代表了 10 年期间发生的每一次火灾。一些栅格单元在那个时间段内发生了多次火灾（因此，我的数据集中会有多行），而一些栅格单元不会发生任何火灾（因此，不会在我的数据集中表示）。因此，数据集中的每一行都有一个列号（连续整数）和分配给它的行号，该行号与栅格中的行和列 ID 相对应。它还具有火灾日期。

我想为彼此相隔 4 天以内以及彼此相邻像素（在 8 格邻域内）的所有火灾分配一个唯一 ID (fire_ID)，然后将这进入一个新专栏。

澄清一下，如果有来自 2000 年 1 月 1 日第 3 行第 3 栏的观察结果和来自 2000 年 1 月 4 日第 2 行第 4 栏的另一个观察结果，这些观察结果将被分配相同的 fire_ID .

下面是一个示例数据集，其中 "rows" 是栅格的行 ID，"cols" 是栅格的列 ID，"dates" 是发现火灾的日期。

rows<-sample(seq(1,50,1),600, replace=TRUE)
cols<-sample(seq(1,50,1),600, replace=TRUE)
dates<-sample(seq(from=as.Date("2000/01/01"), to=as.Date("2000/02/01"), by="day"),600, replace=TRUE)
fire_df<-data.frame(rows, cols, dates)

我尝试按 "row"、"column"、"date" 对数据进行排序并循环遍历，如果行和列ID 在一个值内，日期在 4 天内，但这显然不起作用，因为如果在它们之间有观察，应该分配相同 fire_ID 的火灾被分配不同的 fire_IDs在属于不同 fire_ID.

的列表中

fire_df2<-fire_df[order(fire_df$rows, fire_df$cols, fire_df$date),]
fire_ID=numeric(length=nrow(fire_df2))
fire_ID[1]=1
for (i in 2:nrow(fire_df2)){
fire_ID[i]=ifelse(
fire_df2$rows[i]-fire_df2$rows[i-1]<=abs(1) & fire_df2$cols[i]-fire_df2$cols[i-1]<=abs(1) & fire_df2$date[i]-fire_df2$date[i-1]<=abs(4),
fire_ID[i-1],
i)
}
length(unique(fire_ID))
fire_df2$fire_ID<-fire_ID

如果您有任何建议，请告诉我。

Answer 1

我认为这个任务需要类似于层次聚类的东西。

但是请注意，id 中必然存在一定程度的任意性。这是因为完全有可能火灾集群本身超过 4 天，但每场火灾与该集群中的其他火灾相距不到 4 天（因此应该具有相同的 ID）。

library(dplyr)

# Create the distances
fire_dist <- fire_df %>%
  # Normalize dates
  mutate( norm_dates = as.numeric(dates)/4) %>% 
  # Only keep the three variables of interest
  select( rows, cols, norm_dates ) %>%
  # Compute distance using L-infinite-norm (maximum)
  dist( method="maximum" )

# Do hierarchical clustering with "single" aggl method
fire_clust <- hclust(fire_dist, method="single")

# Cut the tree at height 1 and obtain groups
group_id <- cutree(fire_clust, h=1)

# First attach the group ids back to the data frame
fire_df2 <- cbind( fire_df, group_id ) %>%
  # Then sort the data
  arrange( group_id, dates, rows, cols ) 

# Print the first 20 records
fire_df2[1:10,]

（确保你安装了 dplyr 库。如果没有安装你可以运行 install.packages("dplyr",dep=TRUE)。它是一个非常好的并且非常流行的数据操作库）

几个简单的测试：

测试 #1。同样的森林大火在移动。

rows<-1:6
cols<-1:6
dates<-seq(from=as.Date("2000/01/01"), to=as.Date("2000/01/06"), by="day")
fire_df<-data.frame(rows, cols, dates)

给我这个：

  rows cols      dates group_id
1    1    1 2000-01-01        1
2    2    2 2000-01-02        1
3    3    3 2000-01-03        1
4    4    4 2000-01-04        1
5    5    5 2000-01-05        1
6    6    6 2000-01-06        1

测试#2。 6 种不同的随机森林火灾。

set.seed(1234)

rows<-sample(seq(1,50,1),6, replace=TRUE)
cols<-sample(seq(1,50,1),6, replace=TRUE)
dates<-sample(seq(from=as.Date("2000/01/01"), to=as.Date("2000/02/01"), by="day"),6, replace=TRUE)
fire_df<-data.frame(rows, cols, dates)

输出：

rows cols      dates group_id
1    6    1 2000-01-10        1
2   32   12 2000-01-30        2
3   31   34 2000-01-10        3
4   32   26 2000-01-27        4
5   44   35 2000-01-10        5
6   33   28 2000-01-09        6

测试 #3：一场不断扩大的森林火灾

dates <- seq(from=as.Date("2000/01/01"), to=as.Date("2000/01/06"), by="day")
rows_start <- 50
cols_start <- 50

fire_df <- data.frame(dates = dates) %>%
    rowwise() %>%
    do({
      diff = as.numeric(.$dates - as.Date("2000/01/01"))
      expand.grid(rows=seq(rows_start-diff,rows_start+diff), 
                  cols=seq(cols_start-diff,cols_start+diff),
                  dates=.$dates) 
    })

给我：

  rows cols      dates group_id
1    50   50 2000-01-01        1
2    49   49 2000-01-02        1
3    49   50 2000-01-02        1
4    49   51 2000-01-02        1
5    50   49 2000-01-02        1
6    50   50 2000-01-02        1
7    50   51 2000-01-02        1
8    51   49 2000-01-02        1
9    51   50 2000-01-02        1
10   51   51 2000-01-02        1

等等。（正确识别的所有记录都属于同一场森林火灾。）

根据 R 中的空间邻域和时间标准将行分配给一个组

Assign rows to a group based on spatial neighborhood and temporal criteria in R

grouping

for-loop

r

spatial

temporal-database