使用 XML 包在保管箱中读取 table HTML

Read table HTML in dropbox with XML package

我将尝试使用 XML 包在 dropbox 中读取 table HTML,但 XML::readHTMLTable 功能在 html 中不起作用保管箱,我不知道为什么,有人可以帮助我吗?

我的代码:

套餐

require(httr)
require(XML) 

打开保管箱中的tablehtml文件

FILE <- GET(url="https://www.dropbox.com/s/mb316ghr4irxipr/TALHOES_AGENTES.htm?dl=0") 

阅读table

tables <- getNodeSet(htmlParse(FILE), "//table") 
FE_tab <- readHTMLTable(tables[2], 
                    header = c("empresa","desc_projeto","desc_regiao", 
"cadastrador_por","cod_talhao","descricao", 
"formiga_area","qtd_destruido","latitude", 
                               "longitude","data_cadastro"), 
                    colClasses = c("character","character","character", 
"character","character","character", 
"character","character","character", 
                                   "character","character"), 
                    trim = TRUE, stringsAsFactors = FALSE 
                   ) 
head(FE_tab) ### Doesn’t work

您可以按照以下方式进行:

require(rvest)
doc <- read_html("https://www.dropbox.com/s/mb316ghr4irxipr/TALHOES_AGENTES.htm?dl=1")
FE_tab <- doc %>% html_table() %>% `[[`(1)

在您的代码中,您需要在 URL 的末尾使用 ?dl=1。否则,如果您打开 https://www.dropbox.com/s/mb316ghr4irxipr/TALHOES_AGENTES.htm?dl=0

,您将获得显示的保管箱页面的源代码

如果您仍想使用 XML 软件包,请执行以下操作:

FILE <- GET(url="https://www.dropbox.com/s/mb316ghr4irxipr/TALHOES_AGENTES.htm?dl=1")
tables <- getNodeSet(htmlParse(FILE), "//table") 
FE_tab <- readHTMLTable(tables[[1]], 
                        header = c("empresa","desc_projeto","desc_regiao", 
                                   "cadastrador_por","cod_talhao","descricao", 
                                   "formiga_area","qtd_destruido","latitude", 
                                   "longitude","data_cadastro"), 
                        colClasses = c("character","character","character", 
                                       "character","character","character", 
                                       "character","character","character", 
                                       "character","character"), 
                        trim = TRUE, stringsAsFactors = FALSE 
) 
head(FE_tab)

因为 tables 是一个列表:使用 tables[[1]] 并使用 1 而不是 2,因为表中只有一个 list-element。