RSelenium 和 rvest - 只获取选中复选框的数据
RSelenium and rvest - only getting data for selected check boxes
我可以使用以下方法访问网页:
Data/code:
library(RSelenium)
library(rvest)
library(tidyverse)
rD <- rsDriver(browser="firefox", port=4536L)
remDr <- rD[["client"]]
zona_url_to_get = "https://www.fotocasa.es/es/comprar/viviendas/barcelona-capital/eixample/l"
remDr$navigate(zona_url_to_get)
# accept cookies
remDr$findElement(using = "xpath",'/html/body/div[1]/div[4]/div/div/div/footer/div/button[2]')$clickElement()
#click on Distrito
remDr$findElement(using = "xpath", '/html/body/div[1]/div[2]/div[1]/div[3]/div/div[1]/div')$clickElement()
html_zona_full_page = remDr$getPageSource()[[1]] %>%
read_html()
这会打开页面,接受 cookie,单击下拉菜单并从页面中读取 HTML。
然后我可以使用以下内容:
Zonas_Names = html_zona_full_page %>%
html_nodes('.re-GeographicSearchNext-checkboxItem')
给我:
{xml_nodeset (16)}
[1] <a class="re-GeographicSearchNext-checkboxItem re-GeographicSearchNext-checkboxItem--has-separator" title="Ciutat Vella" href="/es/c ...
[2] <a class="re-GeographicSearchNext-checkboxItem is-checked re-GeographicSearchNext-checkboxItem--has-separator" title="Eixample" href ...
[3] <a class="re-GeographicSearchNext-checkboxItem is-checked" title="Dreta de l'Eixample" href="/es/comprar/viviendas/barcelona-capital ...
[4] <a class="re-GeographicSearchNext-checkboxItem is-checked" title="Fort Pienc" href="/es/comprar/viviendas/barcelona-capital/fort-pie ...
[5] <a class="re-GeographicSearchNext-checkboxItem is-checked" title="La Nova Esquerra de l'Eixample" href="/es/comprar/viviendas/barcel ...
[6] <a class="re-GeographicSearchNext-checkboxItem is-checked" title="L'Antiga Esquerra de l'Eixample" href="/es/comprar/viviendas/barce ...
[7] <a class="re-GeographicSearchNext-checkboxItem is-checked" title="Sagrada Família" href="/es/comprar/viviendas/barcelona-capital/sag ...
[8] <a class="re-GeographicSearchNext-checkboxItem is-checked" title="Sant Antoni" href="/es/comprar/viviendas/barcelona-capital/sant-an ...
[9] <a class="re-GeographicSearchNext-checkboxItem re-GeographicSearchNext-checkboxItem--has-separator" title="Gràcia" href="/es/comprar ...
[10] <a class="re-GeographicSearchNext-checkboxItem re-GeographicSearchNext-checkboxItem--has-separator" title="Horta - Guinardó" href="/ ...
[11] <a class="re-GeographicSearchNext-checkboxItem re-GeographicSearchNext-checkboxItem--has-separator" title="Les Corts" href="/es/comp ...
[12] <a class="re-GeographicSearchNext-checkboxItem re-GeographicSearchNext-checkboxItem--has-separator" title="Nou Barris" href="/es/com ...
[13] <a class="re-GeographicSearchNext-checkboxItem re-GeographicSearchNext-checkboxItem--has-separator" title="Sant Andreu" href="/es/co ...
[14] <a class="re-GeographicSearchNext-checkboxItem re-GeographicSearchNext-checkboxItem--has-separator" title="Sant Martí" href="/es/com ...
[15] <a class="re-GeographicSearchNext-checkboxItem re-GeographicSearchNext-checkboxItem--has-separator" title="Sants - Montjuïc" href="/ ...
[16] <a class="re-GeographicSearchNext-checkboxItem re-GeographicSearchNext-checkboxItem--has-separator" title="Sarrià - Sant Gervasi" hr
但是,我对所有信息都不感兴趣,只对网页上选择的项目(或旁边打勾的项目)感兴趣。它们对应如下:
<a class="re-GeographicSearchNext-checkboxItem is-checked" title="Dreta de l'Eixample"...
<a class="re-GeographicSearchNext-checkboxItem is-checked" title="Fort Pienc"...
<a class="re-GeographicSearchNext-checkboxItem is-checked" title="La Nova Esquerra de l'Eixample"...
... etc.
我的问题是,我怎样才能只保留列表中勾选的项目?
我认为以下可能有效,因为它包含 is-checked
部分,但它 returns a xml_nodeset 0:
> html_zona_full_page %>%
+ html_nodes('.re-GeographicSearchNext-checkboxItem is-checked')
{xml_nodeset (0)}
我可以运行:
html_zona_full_page %>%
html_nodes('.re-GeographicSearchNext-checkboxItem') %>%
html_nodes('.re-GeographicSearchNext-checkboxItem-literal')
这给了我:
{xml_nodeset (16)}
[1] <span class="re-GeographicSearchNext-checkboxItem-literal">Ciutat Vella</span>
[2] <span class="re-GeographicSearchNext-checkboxItem-literal">Eixample</span>
[3] <span class="re-GeographicSearchNext-checkboxItem-literal">Dreta de l'Eixample</span>
[4] <span class="re-GeographicSearchNext-checkboxItem-literal">Fort Pienc</span>
[5] <span class="re-GeographicSearchNext-checkboxItem-literal">La Nova Esquerra de l'Eixample</span>
[6] <span class="re-GeographicSearchNext-checkboxItem-literal">L'Antiga Esquerra de l'Eixample</span>
[7] <span class="re-GeographicSearchNext-checkboxItem-literal">Sagrada Família</span>
[8] <span class="re-GeographicSearchNext-checkboxItem-literal">Sant Antoni</span>
[9] <span class="re-GeographicSearchNext-checkboxItem-literal">Gràcia</span>
[10] <span class="re-GeographicSearchNext-checkboxItem-literal">Horta - Guinardó</span>
[11] <span class="re-GeographicSearchNext-checkboxItem-literal">Les Corts</span>
[12] <span class="re-GeographicSearchNext-checkboxItem-literal">Nou Barris</span>
[13] <span class="re-GeographicSearchNext-checkboxItem-literal">Sant Andreu</span>
[14] <span class="re-GeographicSearchNext-checkboxItem-literal">Sant Martí</span>
[15] <span class="re-GeographicSearchNext-checkboxItem-literal">Sants - Montjuïc</span>
[16] <span class="re-GeographicSearchNext-checkboxItem-literal">Sarrià - Sant Gervasi</span>
但我对 Ciutat Vella
、Gràcia
、Horta
... Sarrià - Sant Gervasi
不感兴趣,因为它们没有在网页上打勾。
归根结底,我只对:
感兴趣
c("Dreta de l'Eixample", "Fort Pienc", "La Nova Esquerra de l'Eixample", "L'Antiga Esquerra de l'Eixample", "Sagrada Família", "Sant Antoni")
我们可以使用.
连接两个元素
Zonas_Names = html_zona_full_page %>%
html_nodes('.re-GeographicSearchNext-checkboxItem.is-checked')
-输出
> Zonas_Names
{xml_nodeset (7)}
[1] <a class="re-GeographicSearchNext-checkboxItem is-checked re-GeographicSearchNext-checkboxItem--has-separator" title="Eixample" href="/es/comprar/viviendas/barcelona-capi ...
[2] <a class="re-GeographicSearchNext-checkboxItem is-checked" title="Dreta de l'Eixample" href="/es/comprar/viviendas/barcelona-capital/dreta-de-l-eixample/l"><div class="su ...
[3] <a class="re-GeographicSearchNext-checkboxItem is-checked" title="Fort Pienc" href="/es/comprar/viviendas/barcelona-capital/fort-pienc/l"><div class="sui-MoleculeCheckbox ...
[4] <a class="re-GeographicSearchNext-checkboxItem is-checked" title="La Nova Esquerra de l'Eixample" href="/es/comprar/viviendas/barcelona-capital/la-nova-esquerra-de-l-eixa ...
[5] <a class="re-GeographicSearchNext-checkboxItem is-checked" title="L'Antiga Esquerra de l'Eixample" href="/es/comprar/viviendas/barcelona-capital/l-antiga-esquerra-de-l-ei ...
[6] <a class="re-GeographicSearchNext-checkboxItem is-checked" title="Sagrada Família" href="/es/comprar/viviendas/barcelona-capital/sagrada-familia/l"><div class="sui-Molecu ...
[7] <a class="re-GeographicSearchNext-checkboxItem is-checked" title="Sant Antoni" href="/es/comprar/viviendas/barcelona-capital/sant-antoni/l"><div class="sui-MoleculeCheckb ...
对应于点击的
我可以使用以下方法访问网页:
Data/code:
library(RSelenium)
library(rvest)
library(tidyverse)
rD <- rsDriver(browser="firefox", port=4536L)
remDr <- rD[["client"]]
zona_url_to_get = "https://www.fotocasa.es/es/comprar/viviendas/barcelona-capital/eixample/l"
remDr$navigate(zona_url_to_get)
# accept cookies
remDr$findElement(using = "xpath",'/html/body/div[1]/div[4]/div/div/div/footer/div/button[2]')$clickElement()
#click on Distrito
remDr$findElement(using = "xpath", '/html/body/div[1]/div[2]/div[1]/div[3]/div/div[1]/div')$clickElement()
html_zona_full_page = remDr$getPageSource()[[1]] %>%
read_html()
这会打开页面,接受 cookie,单击下拉菜单并从页面中读取 HTML。
然后我可以使用以下内容:
Zonas_Names = html_zona_full_page %>%
html_nodes('.re-GeographicSearchNext-checkboxItem')
给我:
{xml_nodeset (16)}
[1] <a class="re-GeographicSearchNext-checkboxItem re-GeographicSearchNext-checkboxItem--has-separator" title="Ciutat Vella" href="/es/c ...
[2] <a class="re-GeographicSearchNext-checkboxItem is-checked re-GeographicSearchNext-checkboxItem--has-separator" title="Eixample" href ...
[3] <a class="re-GeographicSearchNext-checkboxItem is-checked" title="Dreta de l'Eixample" href="/es/comprar/viviendas/barcelona-capital ...
[4] <a class="re-GeographicSearchNext-checkboxItem is-checked" title="Fort Pienc" href="/es/comprar/viviendas/barcelona-capital/fort-pie ...
[5] <a class="re-GeographicSearchNext-checkboxItem is-checked" title="La Nova Esquerra de l'Eixample" href="/es/comprar/viviendas/barcel ...
[6] <a class="re-GeographicSearchNext-checkboxItem is-checked" title="L'Antiga Esquerra de l'Eixample" href="/es/comprar/viviendas/barce ...
[7] <a class="re-GeographicSearchNext-checkboxItem is-checked" title="Sagrada Família" href="/es/comprar/viviendas/barcelona-capital/sag ...
[8] <a class="re-GeographicSearchNext-checkboxItem is-checked" title="Sant Antoni" href="/es/comprar/viviendas/barcelona-capital/sant-an ...
[9] <a class="re-GeographicSearchNext-checkboxItem re-GeographicSearchNext-checkboxItem--has-separator" title="Gràcia" href="/es/comprar ...
[10] <a class="re-GeographicSearchNext-checkboxItem re-GeographicSearchNext-checkboxItem--has-separator" title="Horta - Guinardó" href="/ ...
[11] <a class="re-GeographicSearchNext-checkboxItem re-GeographicSearchNext-checkboxItem--has-separator" title="Les Corts" href="/es/comp ...
[12] <a class="re-GeographicSearchNext-checkboxItem re-GeographicSearchNext-checkboxItem--has-separator" title="Nou Barris" href="/es/com ...
[13] <a class="re-GeographicSearchNext-checkboxItem re-GeographicSearchNext-checkboxItem--has-separator" title="Sant Andreu" href="/es/co ...
[14] <a class="re-GeographicSearchNext-checkboxItem re-GeographicSearchNext-checkboxItem--has-separator" title="Sant Martí" href="/es/com ...
[15] <a class="re-GeographicSearchNext-checkboxItem re-GeographicSearchNext-checkboxItem--has-separator" title="Sants - Montjuïc" href="/ ...
[16] <a class="re-GeographicSearchNext-checkboxItem re-GeographicSearchNext-checkboxItem--has-separator" title="Sarrià - Sant Gervasi" hr
但是,我对所有信息都不感兴趣,只对网页上选择的项目(或旁边打勾的项目)感兴趣。它们对应如下:
<a class="re-GeographicSearchNext-checkboxItem is-checked" title="Dreta de l'Eixample"...
<a class="re-GeographicSearchNext-checkboxItem is-checked" title="Fort Pienc"...
<a class="re-GeographicSearchNext-checkboxItem is-checked" title="La Nova Esquerra de l'Eixample"...
... etc.
我的问题是,我怎样才能只保留列表中勾选的项目?
我认为以下可能有效,因为它包含 is-checked
部分,但它 returns a xml_nodeset 0:
> html_zona_full_page %>%
+ html_nodes('.re-GeographicSearchNext-checkboxItem is-checked')
{xml_nodeset (0)}
我可以运行:
html_zona_full_page %>%
html_nodes('.re-GeographicSearchNext-checkboxItem') %>%
html_nodes('.re-GeographicSearchNext-checkboxItem-literal')
这给了我:
{xml_nodeset (16)}
[1] <span class="re-GeographicSearchNext-checkboxItem-literal">Ciutat Vella</span>
[2] <span class="re-GeographicSearchNext-checkboxItem-literal">Eixample</span>
[3] <span class="re-GeographicSearchNext-checkboxItem-literal">Dreta de l'Eixample</span>
[4] <span class="re-GeographicSearchNext-checkboxItem-literal">Fort Pienc</span>
[5] <span class="re-GeographicSearchNext-checkboxItem-literal">La Nova Esquerra de l'Eixample</span>
[6] <span class="re-GeographicSearchNext-checkboxItem-literal">L'Antiga Esquerra de l'Eixample</span>
[7] <span class="re-GeographicSearchNext-checkboxItem-literal">Sagrada Família</span>
[8] <span class="re-GeographicSearchNext-checkboxItem-literal">Sant Antoni</span>
[9] <span class="re-GeographicSearchNext-checkboxItem-literal">Gràcia</span>
[10] <span class="re-GeographicSearchNext-checkboxItem-literal">Horta - Guinardó</span>
[11] <span class="re-GeographicSearchNext-checkboxItem-literal">Les Corts</span>
[12] <span class="re-GeographicSearchNext-checkboxItem-literal">Nou Barris</span>
[13] <span class="re-GeographicSearchNext-checkboxItem-literal">Sant Andreu</span>
[14] <span class="re-GeographicSearchNext-checkboxItem-literal">Sant Martí</span>
[15] <span class="re-GeographicSearchNext-checkboxItem-literal">Sants - Montjuïc</span>
[16] <span class="re-GeographicSearchNext-checkboxItem-literal">Sarrià - Sant Gervasi</span>
但我对 Ciutat Vella
、Gràcia
、Horta
... Sarrià - Sant Gervasi
不感兴趣,因为它们没有在网页上打勾。
归根结底,我只对:
感兴趣c("Dreta de l'Eixample", "Fort Pienc", "La Nova Esquerra de l'Eixample", "L'Antiga Esquerra de l'Eixample", "Sagrada Família", "Sant Antoni")
我们可以使用.
连接两个元素
Zonas_Names = html_zona_full_page %>%
html_nodes('.re-GeographicSearchNext-checkboxItem.is-checked')
-输出
> Zonas_Names
{xml_nodeset (7)}
[1] <a class="re-GeographicSearchNext-checkboxItem is-checked re-GeographicSearchNext-checkboxItem--has-separator" title="Eixample" href="/es/comprar/viviendas/barcelona-capi ...
[2] <a class="re-GeographicSearchNext-checkboxItem is-checked" title="Dreta de l'Eixample" href="/es/comprar/viviendas/barcelona-capital/dreta-de-l-eixample/l"><div class="su ...
[3] <a class="re-GeographicSearchNext-checkboxItem is-checked" title="Fort Pienc" href="/es/comprar/viviendas/barcelona-capital/fort-pienc/l"><div class="sui-MoleculeCheckbox ...
[4] <a class="re-GeographicSearchNext-checkboxItem is-checked" title="La Nova Esquerra de l'Eixample" href="/es/comprar/viviendas/barcelona-capital/la-nova-esquerra-de-l-eixa ...
[5] <a class="re-GeographicSearchNext-checkboxItem is-checked" title="L'Antiga Esquerra de l'Eixample" href="/es/comprar/viviendas/barcelona-capital/l-antiga-esquerra-de-l-ei ...
[6] <a class="re-GeographicSearchNext-checkboxItem is-checked" title="Sagrada Família" href="/es/comprar/viviendas/barcelona-capital/sagrada-familia/l"><div class="sui-Molecu ...
[7] <a class="re-GeographicSearchNext-checkboxItem is-checked" title="Sant Antoni" href="/es/comprar/viviendas/barcelona-capital/sant-antoni/l"><div class="sui-MoleculeCheckb ...
对应于点击的