从 R 中的网页中抓取多个 table

Question

我正在尝试将共同基金数据提取到 R 中，我的代码方式适用于单个 table，但是当网页中有多个 table 时，它不起作用。

Link - https://in.finance.yahoo.com/q/pm?s=115748.BO

我的代码

url <- "https://in.finance.yahoo.com/q/pm?s=115748.BO"
library(XML)
perftable <- readHTMLTable(url, header = T, which = 1, stringsAsFactors = F)

但我收到一条错误消息。

Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘readHTMLTable’ for signature ‘"NULL"’ In addition: Warning message: XML content does not seem to be XML: 'https://in.finance.yahoo.com/q/pm?s=115748.BO'

我的问题是

如何从该网页中提取特定 table？
如何从该网页中拉出所有 table？
当有多个 link 时，从每个网页

Ahttps://in.finance.yahoo.com/q/pm?s=115748.BO

Ahttps://in.finance.yahoo.com/q/pm?s=115749.BO

Ahttps://in.finance.yahoo.com/q/pm?s=115750.BO

从 link 中删除 "A"，同时使用 link。

Answer 1

Base R 无法访问 https。您可以使用像 RCurl 这样的包。 table 上的 headers 实际上是单独的 table。该页面实际上由 30+ table 组成。您想要的数据最像 table 和 class = yfnc_datamodoutline1 给出的数据：

url <- "https://in.finance.yahoo.com/q/pm?s=115748.BO"
library(XML)
library(RCurl)
appData <- getURL(url, ssl.verifypeer = FALSE)
doc <- htmlParse(appData)
appData <- doc['//table[@class="yfnc_datamodoutline1"]']
perftable <- readHTMLTable(appData[[1]], stringsAsFactors = F)
> perftable
V1      V2
1            Morningstar Return Rating:    2.00
2                  Year-to-Date Return:   2.77%
3                5-Year Average Return:   9.76%
4                   Number of Years Up:       4
5                 Number of Years Down:       1
6  Best 1 Yr Total Return (2014-12-31):  37.05%
7 Worst 1 Yr Total Return (2011-12-31): -27.26%
8         Best 3-Yr Total Return (N/A):  23.11%
9        Worst 3-Yr Total Return (N/A):  -0.33%

Answer 2

这是一个 rvest 版本，增加了从每个基金页面提取特定 table 的功能：

library(rvest)
library(dplyr)

pages <- c("https://in.finance.yahoo.com/q/pm?s=115748.BO", 
           "https://in.finance.yahoo.com/q/pm?s=115749.BO",
           "https://in.finance.yahoo.com/q/pm?s=115750.BO")


extract_tab <- function(sources, tab_idx) {

  data <- lapply(sources, function(x) {

    pg <- html(x)
    pg %>% html_nodes(xpath="//table[@class='yfnc_datamodoutline1']//table") -> tabs
    html_table(tabs[[tab_idx]])

  })

  names(data) <- gsub("pm\?s=", "", basename(sources))

  data

}

extract_tab(pages, 1)

## $`115748.BO`
##                                      X1      X2
## 1            Morningstar Return Rating:    2.00
## 2                  Year-to-Date Return:   2.77%
## 3                5-Year Average Return:   9.76%
## 4                   Number of Years Up:       4
## 5                 Number of Years Down:       1
## 6  Best 1 Yr Total Return (2014-12-31):  37.05%
## 7 Worst 1 Yr Total Return (2011-12-31): -27.26%
## 8         Best 3-Yr Total Return (N/A):  23.11%
## 9        Worst 3-Yr Total Return (N/A):  -0.33%
## 
## $`115749.BO`
##                                      X1      X2
## 1            Morningstar Return Rating:    2.00
## 2                  Year-to-Date Return:   2.77%
## 3                5-Year Average Return:   9.77%
## 4                   Number of Years Up:       4
## 5                 Number of Years Down:       1
## 6  Best 1 Yr Total Return (2014-12-31):  37.05%
## 7 Worst 1 Yr Total Return (2011-12-31): -27.22%
## 8         Best 3-Yr Total Return (N/A):  23.11%
## 9        Worst 3-Yr Total Return (N/A):  -0.30%
## 
## $`115750.BO`
##                               X1    X2
## 1     Morningstar Return Rating:      
## 2           Year-to-Date Return: 1.95%
## 3         5-Year Average Return: 8.92%
## 4            Number of Years Up:      
## 5          Number of Years Down:      
## 6     Best 1 Yr Total Return ():   N/A
## 7    Worst 1 Yr Total Return ():   N/A
## 8  Best 3-Yr Total Return (N/A):   N/A
## 9 Worst 3-Yr Total Return (N/A):   N/A

从 R 中的网页中抓取多个 table

Scraping multiple table out of webpage in R

screen-scraping

r

data.table

我的代码