R XML 来自 SEC Edgar 网站的 href 抓取
R XML href scrape from SEC Edgar web site
我检查了之前的类似问题 - 运气不好...似乎无法 readHTMLTable
阅读 Edgar 网页。我正在尝试阅读此 URL:
...并将"Documents"按钮下的所有href links放入一个字符向量中。
"Documents" link 位于 table - 来自 Firefox 检查工具的第一个 "Documents" href link 看起来像这样:
<div id="seriesDiv" style="margin-top: 0px;">
<table class="tableFile2" summary="Results">
<tbody>
<tr></tr>
<tr>
<td nowrap="nowrap"></td>
<td nowrap="nowrap">
<a id="documentsbutton" href="/Archives/edgar/data/320193/000119312516559625/0001193125-16-559625-index.htm">
Documents
所以我想把 href link 放到一个字符向量中,以备后用。
问题 - XML
库给我带来了麻烦,并且 htmltab
库函数由于某种原因似乎无法在我的 R 实例中得到识别。
这是我的代码:
library(XML)
EDGARURL <- "https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=AAPL&type=10-Q&dateb=&owner=exclude&count=100"
EDGARHREFtables <- readHTMLTable(EDGARURL, as.data.frame = TRUE)
这会导致以下错误:
Warning message:
XML content does not seem to be XML: 'https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=AAPL&type=10-Q&dateb=&owner=exclude&count=100'
我错过了什么? XML
图书馆的 readHTMLTable
会处理这个吗?如果是这样,您如何提取每个文档的 href 标签?
对于简单的工作,rvest
包要容易得多:
library(rvest)
url <- 'https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=AAPL&type=10-Q&dateb=&owner=exclude&count=100'
# pull HTML from page
url %>% read_html() %>%
# get tags with a certain CSS selector
html_nodes('#documentsbutton') %>%
# get the href attribute from each node
html_attr('href')
# [1] "/Archives/edgar/data/320193/000119312516559625/0001193125-16-559625-index.htm"
# [2] "/Archives/edgar/data/320193/000119312516439878/0001193125-16-439878-index.htm"
# [3] "/Archives/edgar/data/320193/000119312515259935/0001193125-15-259935-index.htm"
# [4] "/Archives/edgar/data/320193/000119312515153166/0001193125-15-153166-index.htm"
# [5] "/Archives/edgar/data/320193/000119312515023697/0001193125-15-023697-index.htm"
# [6] "/Archives/edgar/data/320193/000119312514277160/0001193125-14-277160-index.htm"
# [7] "/Archives/edgar/data/320193/000119312514157311/0001193125-14-157311-index.htm"
# [8] "/Archives/edgar/data/320193/000119312514024487/0001193125-14-024487-index.htm"
# [9] "/Archives/edgar/data/320193/000119312513300670/0001193125-13-300670-index.htm"
# [10] "/Archives/edgar/data/320193/000119312513168288/0001193125-13-168288-index.htm"
# ...
我检查了之前的类似问题 - 运气不好...似乎无法 readHTMLTable
阅读 Edgar 网页。我正在尝试阅读此 URL:
...并将"Documents"按钮下的所有href links放入一个字符向量中。
"Documents" link 位于 table - 来自 Firefox 检查工具的第一个 "Documents" href link 看起来像这样:
<div id="seriesDiv" style="margin-top: 0px;">
<table class="tableFile2" summary="Results">
<tbody>
<tr></tr>
<tr>
<td nowrap="nowrap"></td>
<td nowrap="nowrap">
<a id="documentsbutton" href="/Archives/edgar/data/320193/000119312516559625/0001193125-16-559625-index.htm">
Documents
所以我想把 href link 放到一个字符向量中,以备后用。
问题 - XML
库给我带来了麻烦,并且 htmltab
库函数由于某种原因似乎无法在我的 R 实例中得到识别。
这是我的代码:
library(XML)
EDGARURL <- "https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=AAPL&type=10-Q&dateb=&owner=exclude&count=100"
EDGARHREFtables <- readHTMLTable(EDGARURL, as.data.frame = TRUE)
这会导致以下错误:
Warning message: XML content does not seem to be XML: 'https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=AAPL&type=10-Q&dateb=&owner=exclude&count=100'
我错过了什么? XML
图书馆的 readHTMLTable
会处理这个吗?如果是这样,您如何提取每个文档的 href 标签?
对于简单的工作,rvest
包要容易得多:
library(rvest)
url <- 'https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=AAPL&type=10-Q&dateb=&owner=exclude&count=100'
# pull HTML from page
url %>% read_html() %>%
# get tags with a certain CSS selector
html_nodes('#documentsbutton') %>%
# get the href attribute from each node
html_attr('href')
# [1] "/Archives/edgar/data/320193/000119312516559625/0001193125-16-559625-index.htm"
# [2] "/Archives/edgar/data/320193/000119312516439878/0001193125-16-439878-index.htm"
# [3] "/Archives/edgar/data/320193/000119312515259935/0001193125-15-259935-index.htm"
# [4] "/Archives/edgar/data/320193/000119312515153166/0001193125-15-153166-index.htm"
# [5] "/Archives/edgar/data/320193/000119312515023697/0001193125-15-023697-index.htm"
# [6] "/Archives/edgar/data/320193/000119312514277160/0001193125-14-277160-index.htm"
# [7] "/Archives/edgar/data/320193/000119312514157311/0001193125-14-157311-index.htm"
# [8] "/Archives/edgar/data/320193/000119312514024487/0001193125-14-024487-index.htm"
# [9] "/Archives/edgar/data/320193/000119312513300670/0001193125-13-300670-index.htm"
# [10] "/Archives/edgar/data/320193/000119312513168288/0001193125-13-168288-index.htm"
# ...