如何使用 Rvest 和 phantomJS 从 HolidayIQ 抓取所有酒店评论
How to scrape all the hotel reviews from HolidayIQ using Rvest and phantomJS
我想使用 R 中的 Rvest 包从此 hotel main page 中抓取所有用户评论。
我只能检索前 10 条评论。通过单击 'View more' 按钮加载下一组评论,该按钮由 JavaScript.
生成
我写了以下 JavaScript - 'basic.js':
var webPage = require('webpage');
var page = webPage.create();
var fs = require('fs');
var path = 'taj.html'
page.open('http://www.holidayiq.com/Taj-Exotica-Benaulim-hotel-2025.html', function (status) {
var content = page.content;
fs.write(path,content,'w')
phantom.exit();
});
然后,我在 R 中使用了以下命令:
system("./phantomjs basic.js")
输出 'taj.html' 文件没有所有评论。所以,抓取代码...
pg <- read_html("taj.html")
pg %>% html_nodes(".detail-review-by-hotel .srm") %>% html_node(".media-heading") %>% html_text()
...仅 returns 前 10 条评论。
使用 RSelenium:
library(RSelenium)
checkForServer() #just the first time
startServer()
startServer(invisible = FALSE, log = FALSE)
remDr <- remoteDriver(remoteServerAddr = "localhost"
, port = 4444
, browserName = "chrome"
)
remDr$open()
导航到您的页面
remDr$navigate("http://www.holidayiq.com/Taj-Exotica-Benaulim-hotel-2025.html")
单击按钮 "View more" 直到有东西要按(完成后手动停止执行)
while(TRUE){
webElem <- remDr$findElement(using = 'css selector', "#loadMoreTextReview")
remDr$mouseMoveToLocation(webElement = webElem) # move mouse to the element we selected
remDr$click(1) # 2 indicates click the right mouse button
}
使用 css select 或(语法类似于 Rvest)
抓取您需要的所有内容
namesNodes <- remDr$findElements(using = 'css selector', "#result-items .media-heading")
names<-unlist(lapply(namesNodes, function(x){x$getElementText()}))
firstCommentNodes <- remDr$findElements(using = 'css selector', ".featured-blog-clicked") # the second element is the css selector
firstComment<-unlist(lapply(firstCommentNodes, function(x){x$getElementText()}))
reviewNodes <- remDr$findElements(using = 'css selector', ".detail-posted-txt p") # the second element is the css selector
review<-unlist(lapply(reviewNodes, function(x){x$getElementText()}))
我建议阅读 select 或小工具插图以了解如何 select css 路径 -> ftp://cran.r-project.org/pub/R/web/packages/rvest/vignettes/selectorgadget.html
我想使用 R 中的 Rvest 包从此 hotel main page 中抓取所有用户评论。
我只能检索前 10 条评论。通过单击 'View more' 按钮加载下一组评论,该按钮由 JavaScript.
生成我写了以下 JavaScript - 'basic.js':
var webPage = require('webpage');
var page = webPage.create();
var fs = require('fs');
var path = 'taj.html'
page.open('http://www.holidayiq.com/Taj-Exotica-Benaulim-hotel-2025.html', function (status) {
var content = page.content;
fs.write(path,content,'w')
phantom.exit();
});
然后,我在 R 中使用了以下命令:
system("./phantomjs basic.js")
输出 'taj.html' 文件没有所有评论。所以,抓取代码...
pg <- read_html("taj.html")
pg %>% html_nodes(".detail-review-by-hotel .srm") %>% html_node(".media-heading") %>% html_text()
...仅 returns 前 10 条评论。
使用 RSelenium:
library(RSelenium)
checkForServer() #just the first time
startServer()
startServer(invisible = FALSE, log = FALSE)
remDr <- remoteDriver(remoteServerAddr = "localhost"
, port = 4444
, browserName = "chrome"
)
remDr$open()
导航到您的页面
remDr$navigate("http://www.holidayiq.com/Taj-Exotica-Benaulim-hotel-2025.html")
单击按钮 "View more" 直到有东西要按(完成后手动停止执行)
while(TRUE){
webElem <- remDr$findElement(using = 'css selector', "#loadMoreTextReview")
remDr$mouseMoveToLocation(webElement = webElem) # move mouse to the element we selected
remDr$click(1) # 2 indicates click the right mouse button
}
使用 css select 或(语法类似于 Rvest)
抓取您需要的所有内容namesNodes <- remDr$findElements(using = 'css selector', "#result-items .media-heading")
names<-unlist(lapply(namesNodes, function(x){x$getElementText()}))
firstCommentNodes <- remDr$findElements(using = 'css selector', ".featured-blog-clicked") # the second element is the css selector
firstComment<-unlist(lapply(firstCommentNodes, function(x){x$getElementText()}))
reviewNodes <- remDr$findElements(using = 'css selector', ".detail-posted-txt p") # the second element is the css selector
review<-unlist(lapply(reviewNodes, function(x){x$getElementText()}))
我建议阅读 select 或小工具插图以了解如何 select css 路径 -> ftp://cran.r-project.org/pub/R/web/packages/rvest/vignettes/selectorgadget.html