rvest 没有捕获整个 table
rvest not capturing the entire table
您好,我想抓取一个包含 100 行的 table,但是使用 rvest 它似乎最多只能抓取 20 行,然后就停止了。有趣的是,它捕获了整个 table 的第一列,但是在第 20 行之后,其余列为 NA
library(rvest)
library ( xml2)
html <- rvest::read_html("https://coinmarketcap.com/historical/20150621/")
tables <- html_nodes(html, "table")
df = as.data.frame( rvest:: html_table(tables[[3]], fill = TRUE) )
df = df[ , 1:10]
df[1:25, ]
这就是 table 的样子
> df[1:25, ]
Rank Name Symbol Market Cap Price Circulating Supply Volume (24h) % 1h % 24h % 7d
1 1 BTCBitcoin BTC ,488,111,052.52 3.94 14,298,800 BTC ,600,886.00 -0.09% -0.39% 4.33%
2 2 XRPXRP XRP 9,106,281.79 [=12=].01031 31,908,551,587 XRP * 4,946.56 0.68% -6.49% 26.52%
3 3 LTCLitecoin LTC 1,255,276.52 .02 40,119,404 LTC ,196,087.25 0.66% -0.02% 50.72%
4 4 DOGEDogecoin DOGE ,882,626.13 [=12=].0002091 99,890,370,337 DOGE 5,750.50 0.33% -0.46% 25.29%
5 5 BTSBitShares BTS ,410,447.59 [=12=].007727 2,511,953,117 BTS * ,206.36 -1.53% -3.65% 12.20%
6 6 XLMStellar XLM ,058,468.94 [=12=].003526 4,837,354,256 XLM * ,278.98 -2.85% -4.09% 8.34%
7 7 DASHDash DASH ,581,959.93 .84 5,482,231 DASH ,407.43 -0.17% -1.17% 1.37%
8 8 NXTNxt NXT ,625,080.25 [=12=].01363 999,997,096 NXT * ,074.26 0.99% -3.74% 15.89%
9 9 BANXBanx BANX ,648,845.01 .64 5,894,665 BANX * ,804.05 -0.11% -0.41% 4.33%
10 10 PPCPeercoin PPC ,857,457.26 [=12=].3949 22,428,765 PPC ,627.21 -0.46% -5.40% 21.14%
11 11 MAIDMaidSafeCoin MAID ,112,629.90 [=12=].01793 452,552,412 MAID * ,125.53 -0.65% -0.56% 7.06%
12 12 NMCNamecoin NMC ,681,492.39 [=12=].4815 11,800,400 NMC ,962.83 -0.99% -4.69% 43.39%
13 13 BCNBytecoin BCN ,086,827.18 [=12=].00002924 173,955,598,772 BCN ,500.92 0.93% 2.81% 2.53%
14 14 XMRMonero XMR ,286,720.12 [=12=].5233 8,192,114 XMR ,025.62 -1.03% -2.23% 5.73%
15 15 BLKBlackCoin BLK ,932,944.75 [=12=].05248 74,938,648 BLK * 2,834.00 1.26% -3.55% 42.16%
16 16 XCPCounterparty XCP ,358,114.93 .27 2,640,365 XCP * ,235.02 -0.09% 3.81% -6.94%
17 17 VTCVertcoin VTC ,264,822.95 [=12=].2048 15,941,100 VTC ,518.47 -2.41% 2.72% 32.79%
18 18 YBCYbCoin YBC ,161,465.76 .05 3,000,000 YBC * ,359.75 0.11% 2.52% 15.12%
19 19 MONAMonaCoin MONA ,993,610.25 [=12=].1452 20,619,400 MONA ,199.22 -0.88% 3.92% -8.74%
20 20 UNITYSuperNET UNITY ,675,341.46 .28 816,061 UNITY * 4.62 2.47% -3.88% 16.08%
21 NA BitcoinDark
22 NA NuShares
23 NA Primecoin
24 NA Infinitecoin
25 NA Startcoin
有人知道这是怎么回事吗?
这里的问题是页面使用 Javascript 在您向下滚动页面时向 table 添加行,因此当您使用 read_html
.
前200行数据包含在该标签内的页面源代码中,如JSON格式:
<script id="__NEXT_DATA__" type="application/json">
...json here...
</script>
您可以像这样从那里检索数据框:
library(rvest)
library(jsonlite)
json_data <- read_html("https://coinmarketcap.com/historical/20150621/") %>%
html_node("#__NEXT_DATA__") %>%
html_text() %>%
fromJSON()
df_data <- json_data$props$initialState$cryptocurrency$listingHistorical$data
dim(df_data)
[1] 200 16
但是该数据框具有您必须处理的嵌套列。
否则,您需要查看 RSelenium 之类的内容来抓取动态内容。
您好,我想抓取一个包含 100 行的 table,但是使用 rvest 它似乎最多只能抓取 20 行,然后就停止了。有趣的是,它捕获了整个 table 的第一列,但是在第 20 行之后,其余列为 NA
library(rvest)
library ( xml2)
html <- rvest::read_html("https://coinmarketcap.com/historical/20150621/")
tables <- html_nodes(html, "table")
df = as.data.frame( rvest:: html_table(tables[[3]], fill = TRUE) )
df = df[ , 1:10]
df[1:25, ]
这就是 table 的样子
> df[1:25, ]
Rank Name Symbol Market Cap Price Circulating Supply Volume (24h) % 1h % 24h % 7d
1 1 BTCBitcoin BTC ,488,111,052.52 3.94 14,298,800 BTC ,600,886.00 -0.09% -0.39% 4.33%
2 2 XRPXRP XRP 9,106,281.79 [=12=].01031 31,908,551,587 XRP * 4,946.56 0.68% -6.49% 26.52%
3 3 LTCLitecoin LTC 1,255,276.52 .02 40,119,404 LTC ,196,087.25 0.66% -0.02% 50.72%
4 4 DOGEDogecoin DOGE ,882,626.13 [=12=].0002091 99,890,370,337 DOGE 5,750.50 0.33% -0.46% 25.29%
5 5 BTSBitShares BTS ,410,447.59 [=12=].007727 2,511,953,117 BTS * ,206.36 -1.53% -3.65% 12.20%
6 6 XLMStellar XLM ,058,468.94 [=12=].003526 4,837,354,256 XLM * ,278.98 -2.85% -4.09% 8.34%
7 7 DASHDash DASH ,581,959.93 .84 5,482,231 DASH ,407.43 -0.17% -1.17% 1.37%
8 8 NXTNxt NXT ,625,080.25 [=12=].01363 999,997,096 NXT * ,074.26 0.99% -3.74% 15.89%
9 9 BANXBanx BANX ,648,845.01 .64 5,894,665 BANX * ,804.05 -0.11% -0.41% 4.33%
10 10 PPCPeercoin PPC ,857,457.26 [=12=].3949 22,428,765 PPC ,627.21 -0.46% -5.40% 21.14%
11 11 MAIDMaidSafeCoin MAID ,112,629.90 [=12=].01793 452,552,412 MAID * ,125.53 -0.65% -0.56% 7.06%
12 12 NMCNamecoin NMC ,681,492.39 [=12=].4815 11,800,400 NMC ,962.83 -0.99% -4.69% 43.39%
13 13 BCNBytecoin BCN ,086,827.18 [=12=].00002924 173,955,598,772 BCN ,500.92 0.93% 2.81% 2.53%
14 14 XMRMonero XMR ,286,720.12 [=12=].5233 8,192,114 XMR ,025.62 -1.03% -2.23% 5.73%
15 15 BLKBlackCoin BLK ,932,944.75 [=12=].05248 74,938,648 BLK * 2,834.00 1.26% -3.55% 42.16%
16 16 XCPCounterparty XCP ,358,114.93 .27 2,640,365 XCP * ,235.02 -0.09% 3.81% -6.94%
17 17 VTCVertcoin VTC ,264,822.95 [=12=].2048 15,941,100 VTC ,518.47 -2.41% 2.72% 32.79%
18 18 YBCYbCoin YBC ,161,465.76 .05 3,000,000 YBC * ,359.75 0.11% 2.52% 15.12%
19 19 MONAMonaCoin MONA ,993,610.25 [=12=].1452 20,619,400 MONA ,199.22 -0.88% 3.92% -8.74%
20 20 UNITYSuperNET UNITY ,675,341.46 .28 816,061 UNITY * 4.62 2.47% -3.88% 16.08%
21 NA BitcoinDark
22 NA NuShares
23 NA Primecoin
24 NA Infinitecoin
25 NA Startcoin
有人知道这是怎么回事吗?
这里的问题是页面使用 Javascript 在您向下滚动页面时向 table 添加行,因此当您使用 read_html
.
前200行数据包含在该标签内的页面源代码中,如JSON格式:
<script id="__NEXT_DATA__" type="application/json">
...json here...
</script>
您可以像这样从那里检索数据框:
library(rvest)
library(jsonlite)
json_data <- read_html("https://coinmarketcap.com/historical/20150621/") %>%
html_node("#__NEXT_DATA__") %>%
html_text() %>%
fromJSON()
df_data <- json_data$props$initialState$cryptocurrency$listingHistorical$data
dim(df_data)
[1] 200 16
但是该数据框具有您必须处理的嵌套列。
否则,您需要查看 RSelenium 之类的内容来抓取动态内容。