将 HTML 渲染的幻影读入 R
Reading phantoms rendered HTML into R
问题:使用 rvest 我似乎无法从我通过 phantom js 呈现的 html 页面找到我需要的信息块。我已经尝试了几乎所有可能的格式,但我似乎无法 html_node 选择正确的块。
html 从幻影渲染:
<div class="page">
<div class="main-header">
</script>
<div id="listing-703036966" class="shop-srp-listings__listing">
<div class="card listing-row--search hide-fade">
<div class="listing-row__main">
<div class="listing-row__image">
<div class="media-count shadowed">
<a href="/vehicledetail/detail/703036966/overview/" target="_self" class="media-count--photo" data-goto-vdp="703036966" data-standard-link="md-thumb">
25 Photos
</a>
<a href="/vehicledetail/detail/703036966/overview/" target="_self" class="media-count--video" data-goto-vdp="703036966" data-standard-link="md-thumb">
1 Video
</a>
</div>
<a href="/vehicledetail/detail/703036966/overview/" target="_self" class="gray-bg listing-row__photo" data-goto-vdp="703036966" data-standard-link="md-thumb">
<img alt="New 2018 BMW 750 i" src="https://www.cstatic-images.com/phototab/e/1/4/e2/f87fb57ec51cab4f57cbaeb9f9f.jpg" onload="window.performance.mark('serverSideFirstPhotoLoaded')">
</a>
<div class="compare-srp">
<div class="listing-row__save">
<a id="703036966" class="switch-favorite unsaved saveVehicleHeart compare-switch-favorite" savedfeatureinstance="" vehicle="{"listingId":703036966,"mkId":20005,"mkNm":"BMW","mdId":20536,"mdNm":"750","trimId":25905,"trimName":"i","modelYearId":35797618,"modelYear":2018,"stkTyp":"New","state":"NC","zipcode":"27107"}" cars-common-omniture-custom="" omniture-events="">
<div class="save-icon-wrapper">
<div class="cui-icon icon-heart-line">
<svg width="16" height="16" class="icon-image">
<use xlink:href="#cui-icon-heart-outline"></use>
</svg>
</div>
<div class="cui-icon icon-heart">
<svg width="16" height="16" class="icon-image">
<use xlink:href="#cui-icon-heart-fill"></use>
</svg>
</div>
</div>
<p class="saved-label">Save</p>
</a>
</div>
<div class="compare-button" data-compare-listing="703036966">
<div class="compare-icon-wrapper">
<div class="cui-icon icon-plus-sign">
<svg width="16" height="16" class="icon-plus-sign">
<use xlink:href="#cui-icon-plus-sign"></use>
</svg>
</div>
<div class="cui-icon icon-checkmark">
<svg width="16" height="16" class="icon-checkmark">
<use xlink:href="#cui-icon-checkmark"></use>
</svg>
</div>
</div>
<p class="compare-button__label compare">Compare</p>
<p class="compare-button__label added">Added</p>
</div>
</div>
</div>
等等
我在 R 中做了什么
library(rvest)
library(stringr)
library(plyr)
library(dplyr)
library(ggvis)
library(knitr)
library(tidyverse)
cars <- read_html("my file.html") %>%
html_nodes("div") %>%
html_text()
但是,当我检查汽车矢量时,我完全错过了所需的代码块,即:
<a id="703036966" class="switch-favorite unsaved saveVehicleHeart compare-switch-favorite" savedfeatureinstance="" vehicle=". {"listingId":703036966,"mkId":20005,"mkNm":"BMW","mdId":20536,"mdNm":"750","trimId":25905,"trimName":"i","modelYearId":35797618,"modelYear":2018,"stkTyp":"New","state":"NC","zipcode":"27107"}" cars-common-omniture-custom="" omniture-events="">
但它从未被转换成可用的形式,我尝试的所有不同节点都丢失了它 (div, p, span)。
有什么想法吗?
您似乎希望从单个节点解析括号内的内容。
即:字符串 "vehicle='{"listingId":703036966,...",来自具有 css 路径的节点 "a id.703036966 saveVehicleHeart".
由于此节点不包含要在 html 浏览器中呈现的文本,命令 html_text() 将无济于事。相反,您可以将节点代码存储为字符串,然后解析感兴趣的部分。
1.检索节点的字符串。 通往节点的几个可能 css 路径之一是 '.saveVehicleHeart'
library(rvest)
library(stringr)
library(dplyr)
car_html <- read_html("my file.html")
cars <- as.character(html_node(car_html, css = '.saveVehicleHeart'))
2.Extract括号内的内容"{}"
cars <- cars %>%
str_match(., "\{.*?\}") %>% ## Extract everything between the first "{" and the subsequent "}"
gsub("\{|\}", "", .) ## Remove the characters "{" and "}"
3。奖金。把它放到一个漂亮的数据框中。你没有要求这个,但它可能会有帮助。
df_cars <- cars %>%
cbind(read.table(text = ., sep = (','))) %>%
t() %>%
as_data_frame() %>%
.[-1,] %>% ## The first row contains the original unparsed string. We drop it.
separate(., V1, into = c("Variable", "Value"), sep = "\:")
df_cars
# A tibble: 12 × 2
Variable Value
* <chr> <chr>
1 listingId 703036966
2 mkId 20005
3 mkNm BMW
4 mdId 20536
5 mdNm 750
6 trimId 25905
7 trimName i
8 modelYearId 35797618
9 modelYear 2018
10 stkTyp New
11 state NC
12 zipcode 27107
问题:使用 rvest 我似乎无法从我通过 phantom js 呈现的 html 页面找到我需要的信息块。我已经尝试了几乎所有可能的格式,但我似乎无法 html_node 选择正确的块。
html 从幻影渲染:
<div class="page">
<div class="main-header">
</script>
<div id="listing-703036966" class="shop-srp-listings__listing">
<div class="card listing-row--search hide-fade">
<div class="listing-row__main">
<div class="listing-row__image">
<div class="media-count shadowed">
<a href="/vehicledetail/detail/703036966/overview/" target="_self" class="media-count--photo" data-goto-vdp="703036966" data-standard-link="md-thumb">
25 Photos
</a>
<a href="/vehicledetail/detail/703036966/overview/" target="_self" class="media-count--video" data-goto-vdp="703036966" data-standard-link="md-thumb">
1 Video
</a>
</div>
<a href="/vehicledetail/detail/703036966/overview/" target="_self" class="gray-bg listing-row__photo" data-goto-vdp="703036966" data-standard-link="md-thumb">
<img alt="New 2018 BMW 750 i" src="https://www.cstatic-images.com/phototab/e/1/4/e2/f87fb57ec51cab4f57cbaeb9f9f.jpg" onload="window.performance.mark('serverSideFirstPhotoLoaded')">
</a>
<div class="compare-srp">
<div class="listing-row__save">
<a id="703036966" class="switch-favorite unsaved saveVehicleHeart compare-switch-favorite" savedfeatureinstance="" vehicle="{"listingId":703036966,"mkId":20005,"mkNm":"BMW","mdId":20536,"mdNm":"750","trimId":25905,"trimName":"i","modelYearId":35797618,"modelYear":2018,"stkTyp":"New","state":"NC","zipcode":"27107"}" cars-common-omniture-custom="" omniture-events="">
<div class="save-icon-wrapper">
<div class="cui-icon icon-heart-line">
<svg width="16" height="16" class="icon-image">
<use xlink:href="#cui-icon-heart-outline"></use>
</svg>
</div>
<div class="cui-icon icon-heart">
<svg width="16" height="16" class="icon-image">
<use xlink:href="#cui-icon-heart-fill"></use>
</svg>
</div>
</div>
<p class="saved-label">Save</p>
</a>
</div>
<div class="compare-button" data-compare-listing="703036966">
<div class="compare-icon-wrapper">
<div class="cui-icon icon-plus-sign">
<svg width="16" height="16" class="icon-plus-sign">
<use xlink:href="#cui-icon-plus-sign"></use>
</svg>
</div>
<div class="cui-icon icon-checkmark">
<svg width="16" height="16" class="icon-checkmark">
<use xlink:href="#cui-icon-checkmark"></use>
</svg>
</div>
</div>
<p class="compare-button__label compare">Compare</p>
<p class="compare-button__label added">Added</p>
</div>
</div>
</div>
等等
我在 R 中做了什么
library(rvest)
library(stringr)
library(plyr)
library(dplyr)
library(ggvis)
library(knitr)
library(tidyverse)
cars <- read_html("my file.html") %>%
html_nodes("div") %>%
html_text()
但是,当我检查汽车矢量时,我完全错过了所需的代码块,即:
<a id="703036966" class="switch-favorite unsaved saveVehicleHeart compare-switch-favorite" savedfeatureinstance="" vehicle=". {"listingId":703036966,"mkId":20005,"mkNm":"BMW","mdId":20536,"mdNm":"750","trimId":25905,"trimName":"i","modelYearId":35797618,"modelYear":2018,"stkTyp":"New","state":"NC","zipcode":"27107"}" cars-common-omniture-custom="" omniture-events="">
但它从未被转换成可用的形式,我尝试的所有不同节点都丢失了它 (div, p, span)。
有什么想法吗?
您似乎希望从单个节点解析括号内的内容。 即:字符串 "vehicle='{"listingId":703036966,...",来自具有 css 路径的节点 "a id.703036966 saveVehicleHeart".
由于此节点不包含要在 html 浏览器中呈现的文本,命令 html_text() 将无济于事。相反,您可以将节点代码存储为字符串,然后解析感兴趣的部分。
1.检索节点的字符串。 通往节点的几个可能 css 路径之一是 '.saveVehicleHeart'
library(rvest)
library(stringr)
library(dplyr)
car_html <- read_html("my file.html")
cars <- as.character(html_node(car_html, css = '.saveVehicleHeart'))
2.Extract括号内的内容"{}"
cars <- cars %>%
str_match(., "\{.*?\}") %>% ## Extract everything between the first "{" and the subsequent "}"
gsub("\{|\}", "", .) ## Remove the characters "{" and "}"
3。奖金。把它放到一个漂亮的数据框中。你没有要求这个,但它可能会有帮助。
df_cars <- cars %>%
cbind(read.table(text = ., sep = (','))) %>%
t() %>%
as_data_frame() %>%
.[-1,] %>% ## The first row contains the original unparsed string. We drop it.
separate(., V1, into = c("Variable", "Value"), sep = "\:")
df_cars
# A tibble: 12 × 2
Variable Value
* <chr> <chr>
1 listingId 703036966
2 mkId 20005
3 mkNm BMW
4 mdId 20536
5 mdNm 750
6 trimId 25905
7 trimName i
8 modelYearId 35797618
9 modelYear 2018
10 stkTyp New
11 state NC
12 zipcode 27107