从 R 中的 Google Earth KML 文件中提取详细信息
Extracting Details from Google Earth KML File in R
我正在尝试从 Google 地球 kml 文件中的一系列位置获取详细信息。
获取 ID 和坐标有效,但对于位置名称(位于描述的第一个 table 单元格(td 标签)),当我为所有位置执行此操作时,它 returns 所有这些都具有相同的值(Stratford Road - 第一个位置的名称)。
library(sf)
library(tidyverse)
library(rvest)
removeHtmlTags <- function(htmlString) {
return(gsub("<.*?>", "", htmlString))
}
getHtmlTableCells<- function(htmlString) {
# Convert html to html doc
htmldoc <- read_html(htmlString)
# get html for each cell (i.e. within <td></td>)
table_cells_with_tags <- html_nodes(htmldoc, "td")
# remove the html tags (<td></td>)
return(removeHtmlTags(table_cells_with_tags))[1]
}
download.file("https://www.dropbox.com/s/ohipb477kqrqtlz/AQMS_2019.kml?dl=1","aqms.kml")
locations <- st_read("aqms.kml", stringsAsFactors = FALSE) %>%
rename(id = Name) %>%
mutate(latitude = st_coordinates(geometry)[,1],
longitiude = st_coordinates(geometry)[,2],
name = getHtmlTableCells(Description)[1]) %>%
st_drop_geometry()
现在,如果我在特定位置使用该函数并获得第一个 table 单元格 (td),那么它就可以工作,返回第一个单元格 Stratford Road 和 Selly Oak,如下所示。
getHtmlTableCells(locations$Description[1])[1]
getHtmlTableCells(locations$Description[2])[1]
我做错了什么?
read_html
未矢量化 - 它不接受不同 html 的矢量进行解析。我们可以 apply
你对向量的每个元素的函数:
locations <- st_read("aqms.kml", stringsAsFactors = FALSE)
locations %>%
rename(id = Name) %>%
mutate(latitude = st_coordinates(geometry)[,1],
longitiude = st_coordinates(geometry)[,2],
name = sapply(Description, function(x) getHtmlTableCells(x)[1])) %>%
st_drop_geometry()
#> latitude longitiude name
#> 1 -1.871622 52.45920 Stratford Road
#> 2 -1.934559 52.44513 Selly Oak (Bristol Road)
#> 3 -1.830070 52.43771 Acocks Green
#> 4 -1.898731 52.48180 Colmore Row
#> 5 -1.896764 52.48607 St Chads Queensway
#> 6 -1.891955 52.47990 Moor Street Queensway
#> 7 -1.918173 52.48138 Birmingham Ladywood
#> 8 -1.902121 52.47675 Lower Severn Street
#> 9 -1.786413 52.56815 New Hall
#> 10 -1.874989 52.47609 Birmingham A4540 Roadside
或者,由于您在函数中使用 regex,您可以使用 stringr::str_extract
来提取文本(已经矢量化) .
library(sf)
library(tidyverse)
locations <- st_read("aqms.kml", stringsAsFactors = FALSE) %>%
rename(id = Name) %>%
mutate(latitude = st_coordinates(geometry)[,1],
longitiude = st_coordinates(geometry)[,2],
name = str_extract(Description, '(?<=Location</td> <td>)[^<]+')) %>%
st_drop_geometry()
其中 (?<=Location</td> <td>)
是 Location td 标签的回顾,它位于我们的 name 和 [^<]+
之前与 name.
之后的下一个标签匹配
您的 getHtmlTableCells
函数未矢量化。如果你给它传递一个 html 字符串,它工作正常,但如果你给它传递多个字符串,它只会处理第一个。此外,您在 return 语句 之后放置了 [1] 语句,该语句不执行任何操作。它需要在括号内。一旦你这样做,很容易使用 sapply
.
向量化函数
所以对你的函数做一个微小的改变...
getHtmlTableCells <- function(htmlString) {
# Convert html to html doc
htmldoc <- read_html(htmlString)
# get html for each cell (i.e. within <td></td>)
table_cells_with_tags <- html_nodes(htmldoc, "td")
# remove the html tags (<td></td>)
return(removeHtmlTags(table_cells_with_tags)[1])
}
并像这样对其进行矢量化:
download.file("https://www.dropbox.com/s/ohipb477kqrqtlz/AQMS_2019.kml?dl=1","aqms.kml")
locations <- st_read("aqms.kml", stringsAsFactors = FALSE) %>%
rename(id = Name) %>%
mutate(latitude = st_coordinates(geometry)[,1],
longitiude = st_coordinates(geometry)[,2],
name = sapply(as.list(Description), getHtmlTableCells)) %>%
st_drop_geometry()
哪个给出了正确的结果:
locations$name
#> [1] "Stratford Road" "Selly Oak (Bristol Road)"
#> [3] "Acocks Green" "Colmore Row"
#> [5] "St Chads Queensway" "Moor Street Queensway"
#> [7] "Birmingham Ladywood" "Lower Severn Street"
#> [9] "New Hall" "Birmingham A4540 Roadside"
我正在尝试从 Google 地球 kml 文件中的一系列位置获取详细信息。
获取 ID 和坐标有效,但对于位置名称(位于描述的第一个 table 单元格(td 标签)),当我为所有位置执行此操作时,它 returns 所有这些都具有相同的值(Stratford Road - 第一个位置的名称)。
library(sf)
library(tidyverse)
library(rvest)
removeHtmlTags <- function(htmlString) {
return(gsub("<.*?>", "", htmlString))
}
getHtmlTableCells<- function(htmlString) {
# Convert html to html doc
htmldoc <- read_html(htmlString)
# get html for each cell (i.e. within <td></td>)
table_cells_with_tags <- html_nodes(htmldoc, "td")
# remove the html tags (<td></td>)
return(removeHtmlTags(table_cells_with_tags))[1]
}
download.file("https://www.dropbox.com/s/ohipb477kqrqtlz/AQMS_2019.kml?dl=1","aqms.kml")
locations <- st_read("aqms.kml", stringsAsFactors = FALSE) %>%
rename(id = Name) %>%
mutate(latitude = st_coordinates(geometry)[,1],
longitiude = st_coordinates(geometry)[,2],
name = getHtmlTableCells(Description)[1]) %>%
st_drop_geometry()
现在,如果我在特定位置使用该函数并获得第一个 table 单元格 (td),那么它就可以工作,返回第一个单元格 Stratford Road 和 Selly Oak,如下所示。
getHtmlTableCells(locations$Description[1])[1]
getHtmlTableCells(locations$Description[2])[1]
我做错了什么?
read_html
未矢量化 - 它不接受不同 html 的矢量进行解析。我们可以 apply
你对向量的每个元素的函数:
locations <- st_read("aqms.kml", stringsAsFactors = FALSE)
locations %>%
rename(id = Name) %>%
mutate(latitude = st_coordinates(geometry)[,1],
longitiude = st_coordinates(geometry)[,2],
name = sapply(Description, function(x) getHtmlTableCells(x)[1])) %>%
st_drop_geometry()
#> latitude longitiude name
#> 1 -1.871622 52.45920 Stratford Road
#> 2 -1.934559 52.44513 Selly Oak (Bristol Road)
#> 3 -1.830070 52.43771 Acocks Green
#> 4 -1.898731 52.48180 Colmore Row
#> 5 -1.896764 52.48607 St Chads Queensway
#> 6 -1.891955 52.47990 Moor Street Queensway
#> 7 -1.918173 52.48138 Birmingham Ladywood
#> 8 -1.902121 52.47675 Lower Severn Street
#> 9 -1.786413 52.56815 New Hall
#> 10 -1.874989 52.47609 Birmingham A4540 Roadside
或者,由于您在函数中使用 regex,您可以使用 stringr::str_extract
来提取文本(已经矢量化) .
library(sf)
library(tidyverse)
locations <- st_read("aqms.kml", stringsAsFactors = FALSE) %>%
rename(id = Name) %>%
mutate(latitude = st_coordinates(geometry)[,1],
longitiude = st_coordinates(geometry)[,2],
name = str_extract(Description, '(?<=Location</td> <td>)[^<]+')) %>%
st_drop_geometry()
其中 (?<=Location</td> <td>)
是 Location td 标签的回顾,它位于我们的 name 和 [^<]+
之前与 name.
您的 getHtmlTableCells
函数未矢量化。如果你给它传递一个 html 字符串,它工作正常,但如果你给它传递多个字符串,它只会处理第一个。此外,您在 return 语句 之后放置了 [1] 语句,该语句不执行任何操作。它需要在括号内。一旦你这样做,很容易使用 sapply
.
所以对你的函数做一个微小的改变...
getHtmlTableCells <- function(htmlString) {
# Convert html to html doc
htmldoc <- read_html(htmlString)
# get html for each cell (i.e. within <td></td>)
table_cells_with_tags <- html_nodes(htmldoc, "td")
# remove the html tags (<td></td>)
return(removeHtmlTags(table_cells_with_tags)[1])
}
并像这样对其进行矢量化:
download.file("https://www.dropbox.com/s/ohipb477kqrqtlz/AQMS_2019.kml?dl=1","aqms.kml")
locations <- st_read("aqms.kml", stringsAsFactors = FALSE) %>%
rename(id = Name) %>%
mutate(latitude = st_coordinates(geometry)[,1],
longitiude = st_coordinates(geometry)[,2],
name = sapply(as.list(Description), getHtmlTableCells)) %>%
st_drop_geometry()
哪个给出了正确的结果:
locations$name
#> [1] "Stratford Road" "Selly Oak (Bristol Road)"
#> [3] "Acocks Green" "Colmore Row"
#> [5] "St Chads Queensway" "Moor Street Queensway"
#> [7] "Birmingham Ladywood" "Lower Severn Street"
#> [9] "New Hall" "Birmingham A4540 Roadside"