将复杂的 HTML table 刮到 R 中的 data.frame
Scraping a complex HTML table into a data.frame in R
我正在尝试将维基百科关于美国最高法院大法官的数据加载到 R:
library(rvest)
html = html("http://en.wikipedia.org/wiki/List_of_Justices_of_the_Supreme_Court_of_the_United_States")
judges = html_table(html_nodes(html, "table")[[2]])
head(judges[,2])
[1] "Wilson, JamesJames Wilson" "Jay, JohnJohn Jay†"
[3] "Cushing, WilliamWilliam Cushing" "Blair, JohnJohn Blair, Jr."
[5] "Rutledge, JohnJohn Rutledge" "Iredell, JamesJames Iredell"
问题是数据格式不正确。而不是我在实际 HTML table ("James Wilson") 中看到的名字,它实际上出现了两次,一次是 "Lastname, Firstname",然后又是 "Firstname Lastname".
原因是每个其实都包含一个看不见的:
<td style="text-align:left;" class="">
<span style="display:none" class="">Wilson, James</span>
<a href="/wiki/James_Wilson" title="James Wilson">James Wilson</a>
</td>
数值数据的列也是如此。我猜测这个额外的代码对于 HTML table 的排序是必要的。但是,我不清楚在尝试从 R 中的 table 创建 data.frame 时如何删除这些跨度。
你可以使用 rvest
library(rvest)
html("http://en.wikipedia.org/wiki/List_of_Justices_of_the_Supreme_Court_of_the_United_States")%>%
html_nodes("span+ a") %>%
html_text()
它并不完美,因此您可能想要改进 css 选择器,但它会让您非常接近。
可能是这样
library(XML)
library(rvest)
html = html("http://en.wikipedia.org/wiki/List_of_Justices_of_the_Supreme_Court_of_the_United_States")
judges = html_table(html_nodes(html, "table")[[2]])
head(judges[,2])
# [1] "Wilson, JamesJames Wilson" "Jay, JohnJohn Jay†" "Cushing, WilliamWilliam Cushing" "Blair, JohnJohn Blair, Jr."
# [5] "Rutledge, JohnJohn Rutledge" "Iredell, JamesJames Iredel
removeNodes(getNodeSet(html, "//table/tr/td[2]/span"))
judges = html_table(html_nodes(html, "table")[[2]])
head(judges[,2])
# [1] "James Wilson" "John Jay†" "William Cushing" "John Blair, Jr." "John Rutledge" "James Iredell"
我正在尝试将维基百科关于美国最高法院大法官的数据加载到 R:
library(rvest)
html = html("http://en.wikipedia.org/wiki/List_of_Justices_of_the_Supreme_Court_of_the_United_States")
judges = html_table(html_nodes(html, "table")[[2]])
head(judges[,2])
[1] "Wilson, JamesJames Wilson" "Jay, JohnJohn Jay†"
[3] "Cushing, WilliamWilliam Cushing" "Blair, JohnJohn Blair, Jr."
[5] "Rutledge, JohnJohn Rutledge" "Iredell, JamesJames Iredell"
问题是数据格式不正确。而不是我在实际 HTML table ("James Wilson") 中看到的名字,它实际上出现了两次,一次是 "Lastname, Firstname",然后又是 "Firstname Lastname".
原因是每个其实都包含一个看不见的:
<td style="text-align:left;" class="">
<span style="display:none" class="">Wilson, James</span>
<a href="/wiki/James_Wilson" title="James Wilson">James Wilson</a>
</td>
数值数据的列也是如此。我猜测这个额外的代码对于 HTML table 的排序是必要的。但是,我不清楚在尝试从 R 中的 table 创建 data.frame 时如何删除这些跨度。
你可以使用 rvest
library(rvest)
html("http://en.wikipedia.org/wiki/List_of_Justices_of_the_Supreme_Court_of_the_United_States")%>%
html_nodes("span+ a") %>%
html_text()
它并不完美,因此您可能想要改进 css 选择器,但它会让您非常接近。
可能是这样
library(XML)
library(rvest)
html = html("http://en.wikipedia.org/wiki/List_of_Justices_of_the_Supreme_Court_of_the_United_States")
judges = html_table(html_nodes(html, "table")[[2]])
head(judges[,2])
# [1] "Wilson, JamesJames Wilson" "Jay, JohnJohn Jay†" "Cushing, WilliamWilliam Cushing" "Blair, JohnJohn Blair, Jr."
# [5] "Rutledge, JohnJohn Rutledge" "Iredell, JamesJames Iredel
removeNodes(getNodeSet(html, "//table/tr/td[2]/span"))
judges = html_table(html_nodes(html, "table")[[2]])
head(judges[,2])
# [1] "James Wilson" "John Jay†" "William Cushing" "John Blair, Jr." "John Rutledge" "James Iredell"