从 R 中的网页中抓取多个 table
Scraping multiple table out of webpage in R
我正在尝试将共同基金数据提取到 R 中,我的代码方式适用于单个 table,但是当网页中有多个 table 时,它不起作用。
Link - https://in.finance.yahoo.com/q/pm?s=115748.BO
我的代码
url <- "https://in.finance.yahoo.com/q/pm?s=115748.BO"
library(XML)
perftable <- readHTMLTable(url, header = T, which = 1, stringsAsFactors = F)
但我收到一条错误消息。
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘readHTMLTable’ for signature ‘"NULL"’
In addition: Warning message:
XML content does not seem to be XML: 'https://in.finance.yahoo.com/q/pm?s=115748.BO'
我的问题是
- 如何从该网页中提取特定 table?
- 如何从该网页中拉出所有 table?
- 当有多个 link 时,从每个网页
中提取特定 table 的简单方法是什么?
Ahttps://in.finance.yahoo.com/q/pm?s=115748.BO
Ahttps://in.finance.yahoo.com/q/pm?s=115749.BO
Ahttps://in.finance.yahoo.com/q/pm?s=115750.BO
从 link 中删除 "A",同时使用 link。
Base R 无法访问 https
。您可以使用像 RCurl
这样的包。 table 上的 headers 实际上是单独的 table。该页面实际上由 30+ table 组成。您想要的数据最像 table 和 class = yfnc_datamodoutline1
给出的数据:
url <- "https://in.finance.yahoo.com/q/pm?s=115748.BO"
library(XML)
library(RCurl)
appData <- getURL(url, ssl.verifypeer = FALSE)
doc <- htmlParse(appData)
appData <- doc['//table[@class="yfnc_datamodoutline1"]']
perftable <- readHTMLTable(appData[[1]], stringsAsFactors = F)
> perftable
V1 V2
1 Morningstar Return Rating: 2.00
2 Year-to-Date Return: 2.77%
3 5-Year Average Return: 9.76%
4 Number of Years Up: 4
5 Number of Years Down: 1
6 Best 1 Yr Total Return (2014-12-31): 37.05%
7 Worst 1 Yr Total Return (2011-12-31): -27.26%
8 Best 3-Yr Total Return (N/A): 23.11%
9 Worst 3-Yr Total Return (N/A): -0.33%
这是一个 rvest
版本,增加了从每个基金页面提取特定 table 的功能:
library(rvest)
library(dplyr)
pages <- c("https://in.finance.yahoo.com/q/pm?s=115748.BO",
"https://in.finance.yahoo.com/q/pm?s=115749.BO",
"https://in.finance.yahoo.com/q/pm?s=115750.BO")
extract_tab <- function(sources, tab_idx) {
data <- lapply(sources, function(x) {
pg <- html(x)
pg %>% html_nodes(xpath="//table[@class='yfnc_datamodoutline1']//table") -> tabs
html_table(tabs[[tab_idx]])
})
names(data) <- gsub("pm\?s=", "", basename(sources))
data
}
extract_tab(pages, 1)
## $`115748.BO`
## X1 X2
## 1 Morningstar Return Rating: 2.00
## 2 Year-to-Date Return: 2.77%
## 3 5-Year Average Return: 9.76%
## 4 Number of Years Up: 4
## 5 Number of Years Down: 1
## 6 Best 1 Yr Total Return (2014-12-31): 37.05%
## 7 Worst 1 Yr Total Return (2011-12-31): -27.26%
## 8 Best 3-Yr Total Return (N/A): 23.11%
## 9 Worst 3-Yr Total Return (N/A): -0.33%
##
## $`115749.BO`
## X1 X2
## 1 Morningstar Return Rating: 2.00
## 2 Year-to-Date Return: 2.77%
## 3 5-Year Average Return: 9.77%
## 4 Number of Years Up: 4
## 5 Number of Years Down: 1
## 6 Best 1 Yr Total Return (2014-12-31): 37.05%
## 7 Worst 1 Yr Total Return (2011-12-31): -27.22%
## 8 Best 3-Yr Total Return (N/A): 23.11%
## 9 Worst 3-Yr Total Return (N/A): -0.30%
##
## $`115750.BO`
## X1 X2
## 1 Morningstar Return Rating:
## 2 Year-to-Date Return: 1.95%
## 3 5-Year Average Return: 8.92%
## 4 Number of Years Up:
## 5 Number of Years Down:
## 6 Best 1 Yr Total Return (): N/A
## 7 Worst 1 Yr Total Return (): N/A
## 8 Best 3-Yr Total Return (N/A): N/A
## 9 Worst 3-Yr Total Return (N/A): N/A
我正在尝试将共同基金数据提取到 R 中,我的代码方式适用于单个 table,但是当网页中有多个 table 时,它不起作用。
Link - https://in.finance.yahoo.com/q/pm?s=115748.BO
我的代码
url <- "https://in.finance.yahoo.com/q/pm?s=115748.BO"
library(XML)
perftable <- readHTMLTable(url, header = T, which = 1, stringsAsFactors = F)
但我收到一条错误消息。
Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘readHTMLTable’ for signature ‘"NULL"’ In addition: Warning message: XML content does not seem to be XML: 'https://in.finance.yahoo.com/q/pm?s=115748.BO'
我的问题是
- 如何从该网页中提取特定 table?
- 如何从该网页中拉出所有 table?
- 当有多个 link 时,从每个网页 中提取特定 table 的简单方法是什么?
Ahttps://in.finance.yahoo.com/q/pm?s=115748.BO
Ahttps://in.finance.yahoo.com/q/pm?s=115749.BO
Ahttps://in.finance.yahoo.com/q/pm?s=115750.BO
从 link 中删除 "A",同时使用 link。
Base R 无法访问 https
。您可以使用像 RCurl
这样的包。 table 上的 headers 实际上是单独的 table。该页面实际上由 30+ table 组成。您想要的数据最像 table 和 class = yfnc_datamodoutline1
给出的数据:
url <- "https://in.finance.yahoo.com/q/pm?s=115748.BO"
library(XML)
library(RCurl)
appData <- getURL(url, ssl.verifypeer = FALSE)
doc <- htmlParse(appData)
appData <- doc['//table[@class="yfnc_datamodoutline1"]']
perftable <- readHTMLTable(appData[[1]], stringsAsFactors = F)
> perftable
V1 V2
1 Morningstar Return Rating: 2.00
2 Year-to-Date Return: 2.77%
3 5-Year Average Return: 9.76%
4 Number of Years Up: 4
5 Number of Years Down: 1
6 Best 1 Yr Total Return (2014-12-31): 37.05%
7 Worst 1 Yr Total Return (2011-12-31): -27.26%
8 Best 3-Yr Total Return (N/A): 23.11%
9 Worst 3-Yr Total Return (N/A): -0.33%
这是一个 rvest
版本,增加了从每个基金页面提取特定 table 的功能:
library(rvest)
library(dplyr)
pages <- c("https://in.finance.yahoo.com/q/pm?s=115748.BO",
"https://in.finance.yahoo.com/q/pm?s=115749.BO",
"https://in.finance.yahoo.com/q/pm?s=115750.BO")
extract_tab <- function(sources, tab_idx) {
data <- lapply(sources, function(x) {
pg <- html(x)
pg %>% html_nodes(xpath="//table[@class='yfnc_datamodoutline1']//table") -> tabs
html_table(tabs[[tab_idx]])
})
names(data) <- gsub("pm\?s=", "", basename(sources))
data
}
extract_tab(pages, 1)
## $`115748.BO`
## X1 X2
## 1 Morningstar Return Rating: 2.00
## 2 Year-to-Date Return: 2.77%
## 3 5-Year Average Return: 9.76%
## 4 Number of Years Up: 4
## 5 Number of Years Down: 1
## 6 Best 1 Yr Total Return (2014-12-31): 37.05%
## 7 Worst 1 Yr Total Return (2011-12-31): -27.26%
## 8 Best 3-Yr Total Return (N/A): 23.11%
## 9 Worst 3-Yr Total Return (N/A): -0.33%
##
## $`115749.BO`
## X1 X2
## 1 Morningstar Return Rating: 2.00
## 2 Year-to-Date Return: 2.77%
## 3 5-Year Average Return: 9.77%
## 4 Number of Years Up: 4
## 5 Number of Years Down: 1
## 6 Best 1 Yr Total Return (2014-12-31): 37.05%
## 7 Worst 1 Yr Total Return (2011-12-31): -27.22%
## 8 Best 3-Yr Total Return (N/A): 23.11%
## 9 Worst 3-Yr Total Return (N/A): -0.30%
##
## $`115750.BO`
## X1 X2
## 1 Morningstar Return Rating:
## 2 Year-to-Date Return: 1.95%
## 3 5-Year Average Return: 8.92%
## 4 Number of Years Up:
## 5 Number of Years Down:
## 6 Best 1 Yr Total Return (): N/A
## 7 Worst 1 Yr Total Return (): N/A
## 8 Best 3-Yr Total Return (N/A): N/A
## 9 Worst 3-Yr Total Return (N/A): N/A