使用 RSelenium 从 HTML 页面的正文中提取文本

Extract text from body of HTML page with RSelenium

我需要从一堆使用 JavaScript 呈现的网页中提取文本。

下面的代码通常对我有用,只生成文本和 returns 行,这很好。

但是在某些页面上它不起作用。

如何使用 RSelenium 提取“URL 失败”指示网页的正文文本?

library("tidyverse")
library("rvest")
library("RSelenium")

remDr <- remoteDriver(port = 4445L)
remDr$open()

# URL Works
url <- "https://www.td.com/ca/en/personal-banking/products/credit-cards/travel-rewards/rewards-visa-card/"

# URL Fails
# url <- "https://www.bmo.com/main/personal/credit-cards/bmo-cashback-mastercard/"

remDr$navigate(url)

pg <-  
  remDr$getPageSource()[[1]] %>% 
  read_html(encoding = "UTF-8") %>% 
  html_node(xpath = "//body") %>%
  as.character() %>% 
  htm2txt::htm2txt()

remDr$close()

@NadPat 提出的解决方案

url <- "https://www.bmo.com/main/personal/credit-cards/bmo-cashback-mastercard/"
remDr$navigate(url)
text <- remDr$findElement(using = 'xpath', value = '/html')
text$getElementText()

我的结果:

Selenium message:a is null
Build info: version: '2.53.1', revision: 'a36b8b1', time: '2016-06-30 17:37:03'
System info: host: 'fe72a1de69e7', ip: '172.17.0.2', os.name: 'Linux', os.arch: 'amd64', os.version: '5.4.0-84-generic', java.version: '1.8.0_91'
Driver info: driver.version: unknown

Error:   Summary: UnknownError
     Detail: An unknown server-side error occurred while processing the command.
     class: org.openqa.selenium.WebDriverException
     Further Details: run errorDetails method

对于失败的 URL 正在读取某些内容,因为 remDr$getPageSource()[[1]] returns:

[1] "<html xmlns=\"http://www.w3.org/1999/xhtml\"><head><script>\n\nsitePrefix = 'BMO';\nvar pageNameMapping = {};\n\n//channelDemo\npageNameMapping[\"atm_en\"]=\"channelDemo\";\npageNameMapping[\"atm_fr\"]=\"channelDemo\";\n\n//Every Day Banking\npageNameMapping[\"Personal\"]=\"PERS\";\npageNameMapping[\"Bank Accounts\"]=\"Bank-Accounts\";\npageNameMapping[\"Daily savings account\"]=\"Premium-Rate-Savings\";\npageNameMapping[\"High Interest Savings Account\"]=\"Smart-Saver\";\npageNameMapping[\"Chequing account\"]=\"Primary-Chequing\";\npageNameMapping[\"Business Premium Rate Savings\"]=\"Business Premium Rate Account\";\n\n//Cards\npageNameMapping[\"Credit Cards\"]=\"CC\";\n\n\n//Mortgages\npageNameMapping[\"Mortgages\"]=\"MTG\";\npageNameMapping[\"Special Offers\"]=\"Special-Offers\";\n\n//Wealth Management\npageNameMapping[\"Wealth Management\"]=\"Wealth\";\npageNameMapping[\"AdviceDirect\"]=\"Advicedirect\";\n\n//Online Investing\npageNameMapping[\"Online Investing\"]=\"ONL-INVS\";\npageNameMapping...

我使用 Docker 设置 RSelenium 的方式有问题吗?

=======================

更新: 我从 docker 中提取了最新版本的 standalone-firefox,现在 @NadPat 的解决方案对我有用。

docker pull selenium/standalone-firefox:latest

正在启动浏览器,

library(RSelenium)
driver = rsDriver(
     port = 4841L,
       browser = c("firefox"))

remDr <- driver[["client"]]

url <- "https://www.bmo.com/main/personal/credit-cards/bmo-cashback-mastercard/"

第一种方法,

remDr$navigate(url)
text <- remDr$findElement(using = 'xpath', value = '/html')
text$getElementText()
[[1]]
[1] "Skip navigation\nPersonal\nPrivate Wealth\nBusiness\nCommercial\nCapital Markets\nSearch\nFind us\nSupport\nEN\nLogin\nBank Accounts\nCredit Cards\nMortgages\nLoans & Lines of Credit\nInvestments\nFinancial Planning\nInsurance\nWays to Bank\nAbout BMO\nPersonal\nCredit Cards\nBMO CashBack Mastercard\nBMO CashBack® Mastercard®*\nEnjoy the most cash back on groceries in Canada without paying an annual fee\nfootnote\n*\nFootnote\n* Based on a comparison of the non-promotional groce

第二种方法,

text <- remDr$findElement(using = 'xpath', value = '//*[@id="main"]')
text$getElementText()
[[1]]
[1] "Personal\nCredit Cards\nBMO CashBack Mastercard\nBMO CashBack® Mastercard®*\nEnjoy the most cash back on groceries in Canada without paying an annual fee\nfootnote\n*\nFootnote\n* Based on a comparison of the non-promotional grocery rewards earn rate on cash back credit cards with no annual fee as of June 1, 2021.\nWelcome offer\nGet up to 5% cash back in your first 3 months‡‡ and a 1.99% introductory interest rate on balance transfers for 9 months with a 1% transfer fee.§§\nAPPL