使用 XML 包在保管箱中读取 table HTML
Read table HTML in dropbox with XML package
我将尝试使用 XML 包在 dropbox 中读取 table HTML,但 XML::readHTMLTable
功能在 html 中不起作用保管箱,我不知道为什么,有人可以帮助我吗?
我的代码:
套餐
require(httr)
require(XML)
打开保管箱中的tablehtml文件
FILE <- GET(url="https://www.dropbox.com/s/mb316ghr4irxipr/TALHOES_AGENTES.htm?dl=0")
阅读table
tables <- getNodeSet(htmlParse(FILE), "//table")
FE_tab <- readHTMLTable(tables[2],
header = c("empresa","desc_projeto","desc_regiao",
"cadastrador_por","cod_talhao","descricao",
"formiga_area","qtd_destruido","latitude",
"longitude","data_cadastro"),
colClasses = c("character","character","character",
"character","character","character",
"character","character","character",
"character","character"),
trim = TRUE, stringsAsFactors = FALSE
)
head(FE_tab) ### Doesn’t work
您可以按照以下方式进行:
require(rvest)
doc <- read_html("https://www.dropbox.com/s/mb316ghr4irxipr/TALHOES_AGENTES.htm?dl=1")
FE_tab <- doc %>% html_table() %>% `[[`(1)
在您的代码中,您需要在 URL 的末尾使用 ?dl=1
。否则,如果您打开 https://www.dropbox.com/s/mb316ghr4irxipr/TALHOES_AGENTES.htm?dl=0
,您将获得显示的保管箱页面的源代码
如果您仍想使用 XML
软件包,请执行以下操作:
FILE <- GET(url="https://www.dropbox.com/s/mb316ghr4irxipr/TALHOES_AGENTES.htm?dl=1")
tables <- getNodeSet(htmlParse(FILE), "//table")
FE_tab <- readHTMLTable(tables[[1]],
header = c("empresa","desc_projeto","desc_regiao",
"cadastrador_por","cod_talhao","descricao",
"formiga_area","qtd_destruido","latitude",
"longitude","data_cadastro"),
colClasses = c("character","character","character",
"character","character","character",
"character","character","character",
"character","character"),
trim = TRUE, stringsAsFactors = FALSE
)
head(FE_tab)
因为 tables
是一个列表:使用 tables[[1]]
并使用 1 而不是 2,因为表中只有一个 list-element。
我将尝试使用 XML 包在 dropbox 中读取 table HTML,但 XML::readHTMLTable
功能在 html 中不起作用保管箱,我不知道为什么,有人可以帮助我吗?
我的代码:
套餐
require(httr)
require(XML)
打开保管箱中的tablehtml文件
FILE <- GET(url="https://www.dropbox.com/s/mb316ghr4irxipr/TALHOES_AGENTES.htm?dl=0")
阅读table
tables <- getNodeSet(htmlParse(FILE), "//table")
FE_tab <- readHTMLTable(tables[2],
header = c("empresa","desc_projeto","desc_regiao",
"cadastrador_por","cod_talhao","descricao",
"formiga_area","qtd_destruido","latitude",
"longitude","data_cadastro"),
colClasses = c("character","character","character",
"character","character","character",
"character","character","character",
"character","character"),
trim = TRUE, stringsAsFactors = FALSE
)
head(FE_tab) ### Doesn’t work
您可以按照以下方式进行:
require(rvest)
doc <- read_html("https://www.dropbox.com/s/mb316ghr4irxipr/TALHOES_AGENTES.htm?dl=1")
FE_tab <- doc %>% html_table() %>% `[[`(1)
在您的代码中,您需要在 URL 的末尾使用 ?dl=1
。否则,如果您打开 https://www.dropbox.com/s/mb316ghr4irxipr/TALHOES_AGENTES.htm?dl=0
如果您仍想使用 XML
软件包,请执行以下操作:
FILE <- GET(url="https://www.dropbox.com/s/mb316ghr4irxipr/TALHOES_AGENTES.htm?dl=1")
tables <- getNodeSet(htmlParse(FILE), "//table")
FE_tab <- readHTMLTable(tables[[1]],
header = c("empresa","desc_projeto","desc_regiao",
"cadastrador_por","cod_talhao","descricao",
"formiga_area","qtd_destruido","latitude",
"longitude","data_cadastro"),
colClasses = c("character","character","character",
"character","character","character",
"character","character","character",
"character","character"),
trim = TRUE, stringsAsFactors = FALSE
)
head(FE_tab)
因为 tables
是一个列表:使用 tables[[1]]
并使用 1 而不是 2,因为表中只有一个 list-element。