在 R 中爬行时，在一个单元格中保留具有多个元素的 table 架构

Question

在网页中，有一种table在一个单元格中有多个元素的网页。我可以通过以下代码抓取 table 中的内容，但我无法将这些元素绑定为它们的网页架构。我们是否有一些方法可以将这些元素完美地组合在一起，或者我们应该使用其他想法来获得每个元素？

library(XML)   
dataissued <- "http://www.irgrid.ac.cn/handle/1471x/294320/browse?type=dateissued"
    ec_parsed <- htmlTreeParse(dataissued, encoding = "UTF-8", useInternalNodes = TRUE)

# gether content in table and build the dataframe
# title and introduction link of IR resource
item_title <- xpathSApply(ec_parsed, '//td[@headers="t1"]//a', xmlValue)
item_hrefs <- xpathSApply(ec_parsed, '//td[@headers="t1"]//a/@href')
# author and introduction link of IR resource
auth_name <- xpathSApply(ec_parsed, '//td[@headers="t2"]//a', xmlValue)
auth_hrefs <- xpathSApply(ec_parsed, '//td[@headers="t2"]//@href')
# publish date of IR resource
pub_date <- xpathSApply(ec_parsed, '//td[@headers="t3"]', xmlValue)
# whole content link of IR resource
con_link <- xpathSApply(ec_parsed, '//td[@headers="t3"]//a[@href]', xmlValue)

item_table <- cbind(item_title, item_hrefs, auth_name, auth_hrefs, pub_date, con_link)
colnames(item_table) <- c("t1", "href1", "t2", "href2", "t3", "t4", "href4")

我试了很多次了，还是不能把它们整理好，就像一篇论文可能有几个作者，所有的作者和他们的链接应该保存在一个"row"，但是现在一个作者排成一排，论文标题完全重复使用。这使结果变得混乱。

Answer 1

这是从 table 制作长数据框的一种方法：

library(rvest)
library(purrr)
library(tibble)

pg <- read_html("http://www.irgrid.ac.cn/handle/1471x/294320/browse?type=dateissued")

# extract the columns

col1 <- html_nodes(pg, "td[headers='t1']")
col2 <- html_nodes(pg, "td[headers='t2']")
col3 <- html_nodes(pg, "td[headers='t3']")

# this is the way to get the full text column

col4 <- html_nodes(pg, "td[headers='t3'] + td")

# now, iterate over the rows; map_df() will bind all our data.frame's together

map_df(1:legnth(col1), function(i) {

  # extract the links

  a1 <- xml_nodes(col1[i], "a") 
  a2 <- xml_nodes(col2[i], "a")
  a4 <- xml_nodes(col4[i], "a")

  # put the row into a long data.frame for the row

  data_frame(      title = html_text(a1, trim=TRUE),
              title_link = html_attr(a1, "href"),
                  author = html_text(a2, trim=TRUE),
             author_link = html_attr(a2, "href"),
              issue_date = html_text(col3[i], trim=TRUE),
               full_text = html_attr(a4, "href"))

})

Answer 2

使用"rvest"包最大的问题就是乱码。即使程序中使用了参数"encoding"，结果还是乱码。但是网页编码是UTF-8。如：

library(rvest)
pg <- read_html("http://www.irgrid.ac.cn/handle/1471x/294320/browse?type=dateissued", encoding = "UTF-8")

对于我的测试，最好的性能应该是"XML"，当我使用getNodeset函数时，结果是正确的，完全没有乱码。但是，我只得到了整个节点，无法将 table 的每一行与其结构结合起来。

library(XML)
pg <- "http://www.irgrid.ac.cn/handle/1471x/294320/browse?type=dateissued"
pg_tables <- getNodeSet(htmlParse(pg), "//table[@summary='This table browse all dspace content']")
# gether the node of whole table
papernode <- getNodeSet(pg_tables[[1]], "//td[@headers='t1']")
paper_hrefs <- xpathSApply(papernode[[1]], '//a/@href')
paper_name <- xpathSApply(papernode[[1]], '//a', xmlValue)
# gether authors in table
authnode <- getNodeSet(pg_tables[[1]], "//td[@headers='t2']")
# gether date in table
datenode <- getNodeSet(pg_tables[[1]], "//td[@headers='t3']")

有了这个程序，我可以分别得到这些"nodes"。但是，抓取 header 及其链接似乎越来越难。因为 "getNodeSet" 的结果 class 与 "html_nodes" 不同。我们如何才能自动读取 "getNodeSet" 生成的数据帧并以准确的方式从这些节点中提取 header 及其链接？

在 R 中爬行时，在一个单元格中保留具有多个元素的 table 架构

Keep the architecture of table with multiple elements in one cell while crawling in R

webpage

r

dataframe