使用降雪和 sfLapply 在 R 中栅格化多边形
Rasterize polygons in R using snowfall & sfLapply
我想将一个非常大的矢量文件栅格化为 25m,并在 'cluster' 包中取得了一些成功,调整了 qu 的 here and ,它非常适合该特定数据。
但是我现在有一个更大的矢量文件需要光栅化并且可以访问使用降雪的集群。我不习惯集群功能,我只是不确定如何设置 sfLapply。在集群中调用 sfLapply 时,我一直收到以下类型的错误:
Error in checkForRemoteErrors(val) :
one node produced an error: 'quote(96)' is not a function, character or symbol
Calls: sfLapply ... clusterApply -> staticClusterApply -> checkForRemoteErrors
我的完整代码:
library(snowfall)
library(rgeos)
library(maptools)
library(raster)
library(sp)
setwd("/home/dir/")
# Initialise the cluster...
hosts = as.character(read.table(Sys.getenv('PBS_NODEFILE'),header=FALSE)[,1]) # read the nodes to use
sfSetMaxCPUs(length(hosts)) # make sure the maximum allowed number of CPUs matches the number of hosts
sfInit(parallel=TRUE, type="SOCK", socketHosts=hosts, cpus=length(hosts), useRscript=TRUE) # initialise a socket cluster session with the named nodes
sfLibrary(snowfall)
# read in required data
shp <- readShapePoly("my_data.shp")
BNG <- "+proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +datum=OSGB36 +units=m +no_defs"
crs(shp) <- BNG
### rasterize the uniques to 25m and write (GB and clipped) ###
rw <- raster(res=c(25,25), xmn=0, xmx=600000, ymn=0, ymx=1000000, crs=BNG)
# Number of polygons features in SPDF
features <- 1:nrow(shp[,])
# Split features in n parts
n <- 96
parts <- split(features, cut(features, n))
rasFunction = function(X, shape, raster, nparts){
ras = rasterize(shape[nparts[[X]],], raster, 'CODE')
return(ras)
}
# Export everything in the workspace onto the cluster...
sfExportAll()
# Distribute calculation across the cluster nodes...
rDis = sfLapply(n, fun=rasFunction,X=n, shape=shp, raster=rw, nparts=parts) # equivalent of sapply
rMerge <- do.call(merge, rDis)
writeRaster(rMerge, filename="my_data_25m", format="GTiff", overwrite=TRUE)
# Stop the cluster...
sfStop()
我已经尝试了很多东西,更改了函数和 sfLapply,但我就是无法将其设置为 运行。谢谢
因为我无法在评论中进行格式化:
library(maptools)
shp <- readShapePoly("my_data.shp")
BNG <- "+proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +datum=OSGB36 +units=m +no_defs"
shp.2 <- spTransform(shp, BNG)
#Continue as before
覆盖投影!=重新投影数据。
好吧,所以我放弃了降雪,转而研究 gdalUtils::gdal_rasterize,发现使用它有很多好处(有一个缺点,有人可能会回答吗?)
上下文和问题:我的矢量数据存在于 ESRI 文件地理数据库中,需要进行一些预栅格化处理。没问题,rgdal::readOGR 就可以了。然而,由于 gdal_rasterize 需要矢量数据的路径名,我在这里遇到了麻烦,因为我无法写出我处理过的矢量数据,它们超过了地理数据库之外的 shapefile 的最大文件大小,并且 gdal_rasterize 将不接受对象、.gdbs 或 .Rdata/.rds 文件的路径。 如何将对象传递给 gdal_rasterize??
所以我写出了与处理器数量相等的段中的大型 shapefile。
最初使用raster::rasterize是因为我可以简单地将存储在内存中的矢量对象传递给栅格化而不会出现写入问题(尽管我希望将其写入),将此数据栅格化为25m。这花了很长时间,即使是并行的。
解决方案:gdal_rasterize并行。
# gdal_rasterize in parallel
require(gdalUtils)
require(rgdal)
require(rgeos)
require(cluster)
require(parallel)
require(raster)
# read in vector data
shape <- readOGR("./mygdb.gdb", layer="mydata",stringsAsFactors=F)
## do all the vector processing etc ##
# split vector data into n parts, the same as number of processors (minus 1)
npar <- detectCores() - 1
features <- 1:nrow(shape[,])
parts <- split(features, cut(features, npar))
# write the vector parts out
for(n in 1:npar){
writeOGR(shape[parts[[n]],], ".\parts", paste0("mydata_p",n), driver="ESRI Shapefile")
}
# set up and write a blank raster for gdal_rasterize for EACH vector segment created above
r <- raster(res=c(25,25), xmn=234000, xmx=261000, ymn=229000, ymx=256000, crs=projection(shape))
for(n in 1:npar){
writeRaster(r, filename=paste0(".\gdal_p",n,".tif"), format="GTiff", overwrite=TRUE)
}
# set up cluster and pass required packages and objects to cluster
cl <- makeCluster(npar)
clusterEvalQ(cl, sapply(c('raster', 'gdalUtils',"rgdal"), require, char=TRUE))
clusterExport(cl, list("r","npar"))
# parallel apply the gdal_rasterize function against the vector parts that were written,
# same number as processors, against the pre-prepared rasters
parLapply(cl = cl, X = 1:npar, fun = function(x) gdal_rasterize(src_datasource=paste0(".\parts\mydata_p",x,".shp"),
dst_filename=paste0(".\gdal_p",n,".tif"),b=1,a="code",verbose=F,output_Raster=T))
# There are now n rasters representing the n segments of the original vector file
# read in the rasters as a list, merge and write to a new tif.
s <- lapply(X=1:npar, function(x) raster(paste0(".\gdal_p",n,".tif")))
s$filename <- "myras_final.tif"
do.call(merge,s)
stopCluster(cl)
此代码中整个作业的时间(向量 reading/processing/writing 的 60% 和光栅生成和光栅化的 40%)比并行 raster::rasterize 快大约 9 倍。
注意:我最初通过将矢量拆分为 n 个部分但仅创建 1 个空白栅格来尝试此操作。然后我同时从所有集群节点写入同一个空白栅格,但这损坏了栅格并使其在 R/Arc/anything 中无法使用(尽管通过函数没有错误)。上面是比较稳定的方法,但是要制作n个空白光栅,而不是1个,增加处理时间,加上合并n个光栅是额外的处理。
注意事项 - raster::rasterize 并行在光栅化函数中没有 writeRaster 而是作为单独的一行,这将增加原始处理时间 运行 由于存储到临时文件等
编辑:为什么来自 gdal_rasterize 的栅格频率表与 raster::rasterize 不同?我的意思是,对于 1 亿个单元格,我预计会有一些差异,但对于某些代码,它有几千个单元格不同。我以为他们都被质心光栅化了?
我想将一个非常大的矢量文件栅格化为 25m,并在 'cluster' 包中取得了一些成功,调整了 qu 的 here and
但是我现在有一个更大的矢量文件需要光栅化并且可以访问使用降雪的集群。我不习惯集群功能,我只是不确定如何设置 sfLapply。在集群中调用 sfLapply 时,我一直收到以下类型的错误:
Error in checkForRemoteErrors(val) :
one node produced an error: 'quote(96)' is not a function, character or symbol
Calls: sfLapply ... clusterApply -> staticClusterApply -> checkForRemoteErrors
我的完整代码:
library(snowfall)
library(rgeos)
library(maptools)
library(raster)
library(sp)
setwd("/home/dir/")
# Initialise the cluster...
hosts = as.character(read.table(Sys.getenv('PBS_NODEFILE'),header=FALSE)[,1]) # read the nodes to use
sfSetMaxCPUs(length(hosts)) # make sure the maximum allowed number of CPUs matches the number of hosts
sfInit(parallel=TRUE, type="SOCK", socketHosts=hosts, cpus=length(hosts), useRscript=TRUE) # initialise a socket cluster session with the named nodes
sfLibrary(snowfall)
# read in required data
shp <- readShapePoly("my_data.shp")
BNG <- "+proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +datum=OSGB36 +units=m +no_defs"
crs(shp) <- BNG
### rasterize the uniques to 25m and write (GB and clipped) ###
rw <- raster(res=c(25,25), xmn=0, xmx=600000, ymn=0, ymx=1000000, crs=BNG)
# Number of polygons features in SPDF
features <- 1:nrow(shp[,])
# Split features in n parts
n <- 96
parts <- split(features, cut(features, n))
rasFunction = function(X, shape, raster, nparts){
ras = rasterize(shape[nparts[[X]],], raster, 'CODE')
return(ras)
}
# Export everything in the workspace onto the cluster...
sfExportAll()
# Distribute calculation across the cluster nodes...
rDis = sfLapply(n, fun=rasFunction,X=n, shape=shp, raster=rw, nparts=parts) # equivalent of sapply
rMerge <- do.call(merge, rDis)
writeRaster(rMerge, filename="my_data_25m", format="GTiff", overwrite=TRUE)
# Stop the cluster...
sfStop()
我已经尝试了很多东西,更改了函数和 sfLapply,但我就是无法将其设置为 运行。谢谢
因为我无法在评论中进行格式化:
library(maptools)
shp <- readShapePoly("my_data.shp")
BNG <- "+proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +datum=OSGB36 +units=m +no_defs"
shp.2 <- spTransform(shp, BNG)
#Continue as before
覆盖投影!=重新投影数据。
好吧,所以我放弃了降雪,转而研究 gdalUtils::gdal_rasterize,发现使用它有很多好处(有一个缺点,有人可能会回答吗?)
上下文和问题:我的矢量数据存在于 ESRI 文件地理数据库中,需要进行一些预栅格化处理。没问题,rgdal::readOGR 就可以了。然而,由于 gdal_rasterize 需要矢量数据的路径名,我在这里遇到了麻烦,因为我无法写出我处理过的矢量数据,它们超过了地理数据库之外的 shapefile 的最大文件大小,并且 gdal_rasterize 将不接受对象、.gdbs 或 .Rdata/.rds 文件的路径。 如何将对象传递给 gdal_rasterize??
所以我写出了与处理器数量相等的段中的大型 shapefile。
最初使用raster::rasterize是因为我可以简单地将存储在内存中的矢量对象传递给栅格化而不会出现写入问题(尽管我希望将其写入),将此数据栅格化为25m。这花了很长时间,即使是并行的。
解决方案:gdal_rasterize并行。
# gdal_rasterize in parallel
require(gdalUtils)
require(rgdal)
require(rgeos)
require(cluster)
require(parallel)
require(raster)
# read in vector data
shape <- readOGR("./mygdb.gdb", layer="mydata",stringsAsFactors=F)
## do all the vector processing etc ##
# split vector data into n parts, the same as number of processors (minus 1)
npar <- detectCores() - 1
features <- 1:nrow(shape[,])
parts <- split(features, cut(features, npar))
# write the vector parts out
for(n in 1:npar){
writeOGR(shape[parts[[n]],], ".\parts", paste0("mydata_p",n), driver="ESRI Shapefile")
}
# set up and write a blank raster for gdal_rasterize for EACH vector segment created above
r <- raster(res=c(25,25), xmn=234000, xmx=261000, ymn=229000, ymx=256000, crs=projection(shape))
for(n in 1:npar){
writeRaster(r, filename=paste0(".\gdal_p",n,".tif"), format="GTiff", overwrite=TRUE)
}
# set up cluster and pass required packages and objects to cluster
cl <- makeCluster(npar)
clusterEvalQ(cl, sapply(c('raster', 'gdalUtils',"rgdal"), require, char=TRUE))
clusterExport(cl, list("r","npar"))
# parallel apply the gdal_rasterize function against the vector parts that were written,
# same number as processors, against the pre-prepared rasters
parLapply(cl = cl, X = 1:npar, fun = function(x) gdal_rasterize(src_datasource=paste0(".\parts\mydata_p",x,".shp"),
dst_filename=paste0(".\gdal_p",n,".tif"),b=1,a="code",verbose=F,output_Raster=T))
# There are now n rasters representing the n segments of the original vector file
# read in the rasters as a list, merge and write to a new tif.
s <- lapply(X=1:npar, function(x) raster(paste0(".\gdal_p",n,".tif")))
s$filename <- "myras_final.tif"
do.call(merge,s)
stopCluster(cl)
此代码中整个作业的时间(向量 reading/processing/writing 的 60% 和光栅生成和光栅化的 40%)比并行 raster::rasterize 快大约 9 倍。
注意:我最初通过将矢量拆分为 n 个部分但仅创建 1 个空白栅格来尝试此操作。然后我同时从所有集群节点写入同一个空白栅格,但这损坏了栅格并使其在 R/Arc/anything 中无法使用(尽管通过函数没有错误)。上面是比较稳定的方法,但是要制作n个空白光栅,而不是1个,增加处理时间,加上合并n个光栅是额外的处理。
注意事项 - raster::rasterize 并行在光栅化函数中没有 writeRaster 而是作为单独的一行,这将增加原始处理时间 运行 由于存储到临时文件等
编辑:为什么来自 gdal_rasterize 的栅格频率表与 raster::rasterize 不同?我的意思是,对于 1 亿个单元格,我预计会有一些差异,但对于某些代码,它有几千个单元格不同。我以为他们都被质心光栅化了?