从 R 中的 Google Earth KML 文件中提取详细信息

Question

我正在尝试从 Google 地球 kml 文件中的一系列位置获取详细信息。

获取 ID 和坐标有效，但对于位置名称（位于描述的第一个 table 单元格（td 标签）），当我为所有位置执行此操作时，它 returns 所有这些都具有相同的值（Stratford Road - 第一个位置的名称）。

library(sf)
library(tidyverse)
library(rvest)

removeHtmlTags <- function(htmlString) {
  return(gsub("<.*?>", "", htmlString))
}
getHtmlTableCells<- function(htmlString) {
  # Convert html to html doc
  htmldoc <- read_html(htmlString)
  # get html for each cell (i.e. within <td></td>)
  table_cells_with_tags <- html_nodes(htmldoc, "td")
  # remove the html tags (<td></td>)
  return(removeHtmlTags(table_cells_with_tags))[1]
}

download.file("https://www.dropbox.com/s/ohipb477kqrqtlz/AQMS_2019.kml?dl=1","aqms.kml")
locations <- st_read("aqms.kml", stringsAsFactors = FALSE) %>%
  rename(id = Name) %>%
  mutate(latitude = st_coordinates(geometry)[,1],
         longitiude = st_coordinates(geometry)[,2],
         name = getHtmlTableCells(Description)[1]) %>%
  st_drop_geometry()

现在，如果我在特定位置使用该函数并获得第一个 table 单元格 (td)，那么它就可以工作，返回第一个单元格 Stratford Road 和 Selly Oak，如下所示。

getHtmlTableCells(locations$Description[1])[1]
getHtmlTableCells(locations$Description[2])[1]

我做错了什么？

Answer 1

read_html 未矢量化 - 它不接受不同 html 的矢量进行解析。我们可以 apply 你对向量的每个元素的函数：

locations <- st_read("aqms.kml", stringsAsFactors = FALSE) 

locations %>%
  rename(id = Name) %>%
  mutate(latitude = st_coordinates(geometry)[,1],
         longitiude = st_coordinates(geometry)[,2],
         name = sapply(Description, function(x) getHtmlTableCells(x)[1])) %>%
  st_drop_geometry()

#>     latitude longitiude                      name
#> 1  -1.871622   52.45920            Stratford Road
#> 2  -1.934559   52.44513  Selly Oak (Bristol Road)
#> 3  -1.830070   52.43771              Acocks Green
#> 4  -1.898731   52.48180               Colmore Row
#> 5  -1.896764   52.48607        St Chads Queensway
#> 6  -1.891955   52.47990     Moor Street Queensway
#> 7  -1.918173   52.48138       Birmingham Ladywood
#> 8  -1.902121   52.47675       Lower Severn Street
#> 9  -1.786413   52.56815                  New Hall
#> 10 -1.874989   52.47609 Birmingham A4540 Roadside

或者，由于您在函数中使用 regex，您可以使用 stringr::str_extract 来提取文本（已经矢量化） .

library(sf)
library(tidyverse)

locations <- st_read("aqms.kml", stringsAsFactors = FALSE) %>%
  rename(id = Name) %>%
  mutate(latitude = st_coordinates(geometry)[,1],
         longitiude = st_coordinates(geometry)[,2],
         name = str_extract(Description, '(?<=Location</td> <td>)[^<]+')) %>%
  st_drop_geometry()

其中 (?<=Location</td> <td>) 是 Location td 标签的回顾，它位于我们的 name 和 [^<]+ 之前与 name.

之后的下一个标签匹配

Answer 2

您的 getHtmlTableCells 函数未矢量化。如果你给它传递一个 html 字符串，它工作正常，但如果你给它传递多个字符串，它只会处理第一个。此外，您在 return 语句之后放置了 [1] 语句，该语句不执行任何操作。它需要在括号内。一旦你这样做，很容易使用 sapply.

向量化函数
所以对你的函数做一个微小的改变...

getHtmlTableCells <- function(htmlString) { # Convert html to html doc htmldoc <- read_html(htmlString) # get html for each cell (i.e. within <td></td>) table_cells_with_tags <- html_nodes(htmldoc, "td") # remove the html tags (<td></td>) return(removeHtmlTags(table_cells_with_tags)[1]) }

并像这样对其进行矢量化：

download.file("https://www.dropbox.com/s/ohipb477kqrqtlz/AQMS_2019.kml?dl=1","aqms.kml") locations <- st_read("aqms.kml", stringsAsFactors = FALSE) %>% rename(id = Name) %>% mutate(latitude = st_coordinates(geometry)[,1], longitiude = st_coordinates(geometry)[,2], name = sapply(as.list(Description), getHtmlTableCells)) %>% st_drop_geometry()

哪个给出了正确的结果：

locations$name #> [1] "Stratford Road" "Selly Oak (Bristol Road)" #> [3] "Acocks Green" "Colmore Row" #> [5] "St Chads Queensway" "Moor Street Queensway" #> [7] "Birmingham Ladywood" "Lower Severn Street" #> [9] "New Hall" "Birmingham A4540 Roadside"

从 R 中的 Google Earth KML 文件中提取详细信息

Extracting Details from Google Earth KML File in R

r

kml

rvest

sf