在使用 rvest 迭代抓取许多链接时查找标签的替代版本
finding alternate versions of a tag when iteratively scraping many links with rvest
我正在从 sec 档案中抓取一些数据。每个 xml 文档的基本格式为:
<ns1:infoTable>
<ns1:nameOfIssuer>ACCENTURE PLC IRELAND</ns1:nameOfIssuer>
<ns1:titleOfClass>SHS CLASS A</ns1:titleOfClass>
<ns1:cusip>G1151C101</ns1:cusip>
<ns1:value>47837</ns1:value>
<ns1:shrsOrPrnAmt>
<ns1:sshPrnamt>183135</ns1:sshPrnamt>
<ns1:sshPrnamtType>SH</ns1:sshPrnamtType>
</ns1:shrsOrPrnAmt>
<ns1:investmentDiscretion>SOLE</ns1:investmentDiscretion>
<ns1:votingAuthority>
<ns1:Sole>0</ns1:Sole>
<ns1:Shared>0</ns1:Shared>
<ns1:None>183135</ns1:None>
</ns1:votingAuthority>
</ns1:infoTable>
但是,有时文档具有以下形式:
<infoTable>
<nameOfIssuer>2U INC</nameOfIssuer>
<titleOfClass>COM</titleOfClass>
<cusip>90214J101</cusip>
<value>340</value>
<shrsOrPrnAmt>
<sshPrnamt>8504</sshPrnamt>
<sshPrnamtType>SH</sshPrnamtType>
</shrsOrPrnAmt>
<investmentDiscretion>SOLE</investmentDiscretion>
<votingAuthority>
<Sole>8504</Sole>
<Shared>0</Shared>
<None>0</None>
</votingAuthority>
</infoTable>
所以标签的唯一区别是添加了“ns1:”前缀。
在抓取数据时,我能够找到这样的节点:
urll <- "https://www.sec.gov/Archives/edgar/data/1002152/000108514621000479/infotable.xml"
session %>%
nod(urll) %>%
scrape(verbose = FALSE) %>%
xml_ns_strip() %>%
xml_find_all('ns1:infoTable')
或没有 ns1: 前缀的备用标签
urll <- "https://www.sec.gov/Archives/edgar/data/1002672/000106299321000915/form13fInfoTable.xml"
session %>%
nod(urll) %>%
scrape(verbose = FALSE) %>%
xml_ns_strip() %>%
xml_find_all('infoTable')
但是当遍历多个链接时,我不知道哪个 xml 文件将有哪个标签。有没有办法通过使用“或”运算符指定节点或查找具有字符串匹配的标签来查找标签中的特定文本“infoTable”来获取节点?
我试过了:
session %>%
nod(urll) %>%
scrape(verbose = FALSE) %>%
xml_ns_strip() %>%
xml_find_all(xpath = '//*[self::infoTable or self::ns1:infoTable]')
或
session %>%
nod(urll) %>%
scrape(verbose = FALSE) %>%
xml_ns_strip() %>%
xml_find_all(xpath = "//*[contains(text(),'infoTable')]")
但这两种变体都不起作用。关于如何让它工作有什么建议吗?
提前致谢。我正在使用 polite, rvest, dplyr
考虑在您的 XPath 表达式中使用 local-name()
。下面使用 httr
和新的 R 4.1.0+ 管道 |>
:
library(xml2)
library(httr)
url <- "https://www.sec.gov/Archives/edgar/data/1002152/000108514621000479/infotable.xml"
info_tables <- httr::GET(url, user_agent("Mozilla/5.0")) |>
httr::content(encoding="UTF-8") |>
xml2::xml_find_all(xpath = "//*[local-name()='infoTable']")
并构建数据框:
df_list <- lapply(info_tables, function(r) {
vals <- xml2::xml_children(r)
other_vals <- xml2::xml_find_all(r, "*") |>
xml2::xml_children()
child_df <- setNames(
c(xml2::xml_text(vals)),
c(xml2::xml_name(vals))
) |> rbind() |> data.frame()
grand_df <- setNames(
c(xml2::xml_text(other_vals)),
c(xml2::xml_name(other_vals))
) |> rbind() |> data.frame()
cbind.data.frame(child_df, grand_df)
})
final_df <- do.call(rbind.data.frame, df_list)
final_df
nameOfIssuer titleOfClass cusip value shrsOrPrnAmt investmentDiscretion votingAuthority sshPrnamt sshPrnamtType Sole Shared None
1 ACCENTURE PLC IRELAND SHS CLASS A G1151C101 47837 183135SH SOLE 00183135 183135 SH 0 0 183135
2 ALPHABET INC CAP STK CL A 02079K305 43695 24931SH SOLE 0024931 24931 SH 0 0 24931
3 APPLE INC COM 037833100 3229 24334SH SOLE 0024334 24334 SH 0 0 24334
4 BERKSHIRE HATHAWAY INC DEL CL A 084670108 2783 8SH SOLE 008 8 SH 0 0 8
5 CANADIAN NATL RY CO COM 136375102 218 1985SH SOLE 001985 1985 SH 0 0 1985
6 CHECK POINT SOFTWARE TECH LT ORD M22465104 45505 342375SH SOLE 00342375 342375 SH 0 0 342375
7 CHURCH & DWIGHT INC COM 171340102 42500 487221SH SOLE 00487221 487221 SH 0 0 487221
8 COGNIZANT TECHNOLOGY SOLUTIO CL A 192446102 46076 562243SH SOLE 00562243 562243 SH 0 0 562243
9 CVS HEALTH CORP COM 126650100 44311 648773SH SOLE 00648773 648773 SH 0 0 648773
10 DANAHER CORPORATION COM 235851102 44200 198974SH SOLE 00198974 198974 SH 0 0 198974
我正在从 sec 档案中抓取一些数据。每个 xml 文档的基本格式为:
<ns1:infoTable>
<ns1:nameOfIssuer>ACCENTURE PLC IRELAND</ns1:nameOfIssuer>
<ns1:titleOfClass>SHS CLASS A</ns1:titleOfClass>
<ns1:cusip>G1151C101</ns1:cusip>
<ns1:value>47837</ns1:value>
<ns1:shrsOrPrnAmt>
<ns1:sshPrnamt>183135</ns1:sshPrnamt>
<ns1:sshPrnamtType>SH</ns1:sshPrnamtType>
</ns1:shrsOrPrnAmt>
<ns1:investmentDiscretion>SOLE</ns1:investmentDiscretion>
<ns1:votingAuthority>
<ns1:Sole>0</ns1:Sole>
<ns1:Shared>0</ns1:Shared>
<ns1:None>183135</ns1:None>
</ns1:votingAuthority>
</ns1:infoTable>
但是,有时文档具有以下形式:
<infoTable>
<nameOfIssuer>2U INC</nameOfIssuer>
<titleOfClass>COM</titleOfClass>
<cusip>90214J101</cusip>
<value>340</value>
<shrsOrPrnAmt>
<sshPrnamt>8504</sshPrnamt>
<sshPrnamtType>SH</sshPrnamtType>
</shrsOrPrnAmt>
<investmentDiscretion>SOLE</investmentDiscretion>
<votingAuthority>
<Sole>8504</Sole>
<Shared>0</Shared>
<None>0</None>
</votingAuthority>
</infoTable>
所以标签的唯一区别是添加了“ns1:”前缀。
在抓取数据时,我能够找到这样的节点:
urll <- "https://www.sec.gov/Archives/edgar/data/1002152/000108514621000479/infotable.xml"
session %>%
nod(urll) %>%
scrape(verbose = FALSE) %>%
xml_ns_strip() %>%
xml_find_all('ns1:infoTable')
或没有 ns1: 前缀的备用标签
urll <- "https://www.sec.gov/Archives/edgar/data/1002672/000106299321000915/form13fInfoTable.xml"
session %>%
nod(urll) %>%
scrape(verbose = FALSE) %>%
xml_ns_strip() %>%
xml_find_all('infoTable')
但是当遍历多个链接时,我不知道哪个 xml 文件将有哪个标签。有没有办法通过使用“或”运算符指定节点或查找具有字符串匹配的标签来查找标签中的特定文本“infoTable”来获取节点?
我试过了:
session %>%
nod(urll) %>%
scrape(verbose = FALSE) %>%
xml_ns_strip() %>%
xml_find_all(xpath = '//*[self::infoTable or self::ns1:infoTable]')
或
session %>%
nod(urll) %>%
scrape(verbose = FALSE) %>%
xml_ns_strip() %>%
xml_find_all(xpath = "//*[contains(text(),'infoTable')]")
但这两种变体都不起作用。关于如何让它工作有什么建议吗?
提前致谢。我正在使用 polite, rvest, dplyr
考虑在您的 XPath 表达式中使用 local-name()
。下面使用 httr
和新的 R 4.1.0+ 管道 |>
:
library(xml2)
library(httr)
url <- "https://www.sec.gov/Archives/edgar/data/1002152/000108514621000479/infotable.xml"
info_tables <- httr::GET(url, user_agent("Mozilla/5.0")) |>
httr::content(encoding="UTF-8") |>
xml2::xml_find_all(xpath = "//*[local-name()='infoTable']")
并构建数据框:
df_list <- lapply(info_tables, function(r) {
vals <- xml2::xml_children(r)
other_vals <- xml2::xml_find_all(r, "*") |>
xml2::xml_children()
child_df <- setNames(
c(xml2::xml_text(vals)),
c(xml2::xml_name(vals))
) |> rbind() |> data.frame()
grand_df <- setNames(
c(xml2::xml_text(other_vals)),
c(xml2::xml_name(other_vals))
) |> rbind() |> data.frame()
cbind.data.frame(child_df, grand_df)
})
final_df <- do.call(rbind.data.frame, df_list)
final_df
nameOfIssuer titleOfClass cusip value shrsOrPrnAmt investmentDiscretion votingAuthority sshPrnamt sshPrnamtType Sole Shared None
1 ACCENTURE PLC IRELAND SHS CLASS A G1151C101 47837 183135SH SOLE 00183135 183135 SH 0 0 183135
2 ALPHABET INC CAP STK CL A 02079K305 43695 24931SH SOLE 0024931 24931 SH 0 0 24931
3 APPLE INC COM 037833100 3229 24334SH SOLE 0024334 24334 SH 0 0 24334
4 BERKSHIRE HATHAWAY INC DEL CL A 084670108 2783 8SH SOLE 008 8 SH 0 0 8
5 CANADIAN NATL RY CO COM 136375102 218 1985SH SOLE 001985 1985 SH 0 0 1985
6 CHECK POINT SOFTWARE TECH LT ORD M22465104 45505 342375SH SOLE 00342375 342375 SH 0 0 342375
7 CHURCH & DWIGHT INC COM 171340102 42500 487221SH SOLE 00487221 487221 SH 0 0 487221
8 COGNIZANT TECHNOLOGY SOLUTIO CL A 192446102 46076 562243SH SOLE 00562243 562243 SH 0 0 562243
9 CVS HEALTH CORP COM 126650100 44311 648773SH SOLE 00648773 648773 SH 0 0 648773
10 DANAHER CORPORATION COM 235851102 44200 198974SH SOLE 00198974 198974 SH 0 0 198974