尝试网络抓取时 R 中的 Phantomjs returns 404

Question

我正在尝试从 OTC 市场收集一些数据（在他们 robots.txt 的范围内），但我无法连接到网页。

我尝试的第一步只是直接从页面上抓取 HTML，但页面需要 javascript 才能加载。
所以我下载了 phantomjs 并以这种方式连接。但是，这会导致出现 404 错误页面
然后我将用户代理更改为类似于用户的东西，看看它是否能让我连接，但仍然没有成功！这是怎么回事

这是我的代码的可重现版本，如有任何帮助，我们将不胜感激。 Phantomjs 可以在这里下载：http://phantomjs.org/

library(rvest)
library(xml2)
library(V8)
# example website, I have no correlation to this stock
url <- 'https://www.otcmarkets.com/stock/YTROF/profile' 

# create javascript file that phantomjs can process
writeLines(sprintf("var page = require('webpage').create();
page.settings.userAgent = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36';
page.open('%s', function () {
    console.log(page.content); //page source
    phantom.exit();
});", url), con="scrape.js")

html <- system("phantomjs.exe_PATH scrape.js", intern = TRUE)
page_html <- read_html(html)

Answer 1

我已经能够使用以下代码获取 html 内容，该代码不是基于 PhantomJS，而是基于 Selenium：

library(RSelenium)
shell('docker run -d -p 4445:4444 selenium/standalone-firefox')
remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4445L, browserName = "firefox")
remDr$open()
remDr$navigate('https://www.otcmarkets.com/stock/YTROF/profile')

remDr$executeScript("scroll(0, 5000)")
remDr$executeScript("scroll(0, 10000)")
remDr$executeScript("scroll(0, 15000)")
Sys.sleep(4)

remDr$screenshot(display = TRUE, useViewer = TRUE) 
html_Content <- remDr$getPageSource()[[1]]

在我们提取 html 内容之前，给页面加载时间很重要。

这是另一种基于 RDCOMClient 的方法：

library(RDCOMClient)
url <- 'https://www.otcmarkets.com/stock/YTROF/profile'
IEApp <- COMCreate("InternetExplorer.Application")
IEApp[['Visible']] <- TRUE
IEApp$Navigate(url)

Sys.sleep(5)
doc <- IEApp$Document()

Sys.sleep(5)
html_Content <- doc$documentElement()$innerText()

尝试网络抓取时 R 中的 Phantomjs returns 404

Phantomjs returns 404 in R when attempting webscraping

r

web-scraping

phantomjs