使用 XML 包在保管箱中读取 table HTML

Question

我将尝试使用 XML 包在 dropbox 中读取 table HTML，但 XML::readHTMLTable 功能在 html 中不起作用保管箱，我不知道为什么，有人可以帮助我吗？

我的代码：

套餐

require(httr)
require(XML)

打开保管箱中的tablehtml文件

FILE <- GET(url="https://www.dropbox.com/s/mb316ghr4irxipr/TALHOES_AGENTES.htm?dl=0")

阅读table

tables <- getNodeSet(htmlParse(FILE), "//table") 
FE_tab <- readHTMLTable(tables[2], 
                    header = c("empresa","desc_projeto","desc_regiao", 
"cadastrador_por","cod_talhao","descricao", 
"formiga_area","qtd_destruido","latitude", 
                               "longitude","data_cadastro"), 
                    colClasses = c("character","character","character", 
"character","character","character", 
"character","character","character", 
                                   "character","character"), 
                    trim = TRUE, stringsAsFactors = FALSE 
                   ) 
head(FE_tab) ### Doesn’t work

Answer 1

您可以按照以下方式进行：

require(rvest)
doc <- read_html("https://www.dropbox.com/s/mb316ghr4irxipr/TALHOES_AGENTES.htm?dl=1")
FE_tab <- doc %>% html_table() %>% `[[`(1)

在您的代码中，您需要在 URL 的末尾使用 ?dl=1。否则，如果您打开 https://www.dropbox.com/s/mb316ghr4irxipr/TALHOES_AGENTES.htm?dl=0

，您将获得显示的保管箱页面的源代码

如果您仍想使用 XML 软件包，请执行以下操作：

FILE <- GET(url="https://www.dropbox.com/s/mb316ghr4irxipr/TALHOES_AGENTES.htm?dl=1")
tables <- getNodeSet(htmlParse(FILE), "//table") 
FE_tab <- readHTMLTable(tables[[1]], 
                        header = c("empresa","desc_projeto","desc_regiao", 
                                   "cadastrador_por","cod_talhao","descricao", 
                                   "formiga_area","qtd_destruido","latitude", 
                                   "longitude","data_cadastro"), 
                        colClasses = c("character","character","character", 
                                       "character","character","character", 
                                       "character","character","character", 
                                       "character","character"), 
                        trim = TRUE, stringsAsFactors = FALSE 
) 
head(FE_tab)

因为 tables 是一个列表：使用 tables[[1]] 并使用 1 而不是 2，因为表中只有一个 list-element。

使用 XML 包在保管箱中读取 table HTML

Read table HTML in dropbox with XML package

xml

r

html-table

httr

rvest

套餐

打开保管箱中的tablehtml文件

阅读table