使用 RSelenium 从 HTML 页面的正文中提取文本
Extract text from body of HTML page with RSelenium
我需要从一堆使用 JavaScript 呈现的网页中提取文本。
下面的代码通常对我有用,只生成文本和 returns 行,这很好。
但是在某些页面上它不起作用。
如何使用 RSelenium 提取“URL 失败”指示网页的正文文本?
library("tidyverse")
library("rvest")
library("RSelenium")
remDr <- remoteDriver(port = 4445L)
remDr$open()
# URL Works
url <- "https://www.td.com/ca/en/personal-banking/products/credit-cards/travel-rewards/rewards-visa-card/"
# URL Fails
# url <- "https://www.bmo.com/main/personal/credit-cards/bmo-cashback-mastercard/"
remDr$navigate(url)
pg <-
remDr$getPageSource()[[1]] %>%
read_html(encoding = "UTF-8") %>%
html_node(xpath = "//body") %>%
as.character() %>%
htm2txt::htm2txt()
remDr$close()
@NadPat 提出的解决方案
url <- "https://www.bmo.com/main/personal/credit-cards/bmo-cashback-mastercard/"
remDr$navigate(url)
text <- remDr$findElement(using = 'xpath', value = '/html')
text$getElementText()
我的结果:
Selenium message:a is null
Build info: version: '2.53.1', revision: 'a36b8b1', time: '2016-06-30 17:37:03'
System info: host: 'fe72a1de69e7', ip: '172.17.0.2', os.name: 'Linux', os.arch: 'amd64', os.version: '5.4.0-84-generic', java.version: '1.8.0_91'
Driver info: driver.version: unknown
Error: Summary: UnknownError
Detail: An unknown server-side error occurred while processing the command.
class: org.openqa.selenium.WebDriverException
Further Details: run errorDetails method
对于失败的 URL 正在读取某些内容,因为
remDr$getPageSource()[[1]]
returns:
[1] "<html xmlns=\"http://www.w3.org/1999/xhtml\"><head><script>\n\nsitePrefix = 'BMO';\nvar pageNameMapping = {};\n\n//channelDemo\npageNameMapping[\"atm_en\"]=\"channelDemo\";\npageNameMapping[\"atm_fr\"]=\"channelDemo\";\n\n//Every Day Banking\npageNameMapping[\"Personal\"]=\"PERS\";\npageNameMapping[\"Bank Accounts\"]=\"Bank-Accounts\";\npageNameMapping[\"Daily savings account\"]=\"Premium-Rate-Savings\";\npageNameMapping[\"High Interest Savings Account\"]=\"Smart-Saver\";\npageNameMapping[\"Chequing account\"]=\"Primary-Chequing\";\npageNameMapping[\"Business Premium Rate Savings\"]=\"Business Premium Rate Account\";\n\n//Cards\npageNameMapping[\"Credit Cards\"]=\"CC\";\n\n\n//Mortgages\npageNameMapping[\"Mortgages\"]=\"MTG\";\npageNameMapping[\"Special Offers\"]=\"Special-Offers\";\n\n//Wealth Management\npageNameMapping[\"Wealth Management\"]=\"Wealth\";\npageNameMapping[\"AdviceDirect\"]=\"Advicedirect\";\n\n//Online Investing\npageNameMapping[\"Online Investing\"]=\"ONL-INVS\";\npageNameMapping...
我使用 Docker 设置 RSelenium 的方式有问题吗?
=======================
更新:
我从 docker 中提取了最新版本的 standalone-firefox
,现在 @NadPat 的解决方案对我有用。
docker pull selenium/standalone-firefox:latest
正在启动浏览器,
library(RSelenium)
driver = rsDriver(
port = 4841L,
browser = c("firefox"))
remDr <- driver[["client"]]
url <- "https://www.bmo.com/main/personal/credit-cards/bmo-cashback-mastercard/"
第一种方法,
remDr$navigate(url)
text <- remDr$findElement(using = 'xpath', value = '/html')
text$getElementText()
[[1]]
[1] "Skip navigation\nPersonal\nPrivate Wealth\nBusiness\nCommercial\nCapital Markets\nSearch\nFind us\nSupport\nEN\nLogin\nBank Accounts\nCredit Cards\nMortgages\nLoans & Lines of Credit\nInvestments\nFinancial Planning\nInsurance\nWays to Bank\nAbout BMO\nPersonal\nCredit Cards\nBMO CashBack Mastercard\nBMO CashBack® Mastercard®*\nEnjoy the most cash back on groceries in Canada without paying an annual fee\nfootnote\n*\nFootnote\n* Based on a comparison of the non-promotional groce
第二种方法,
text <- remDr$findElement(using = 'xpath', value = '//*[@id="main"]')
text$getElementText()
[[1]]
[1] "Personal\nCredit Cards\nBMO CashBack Mastercard\nBMO CashBack® Mastercard®*\nEnjoy the most cash back on groceries in Canada without paying an annual fee\nfootnote\n*\nFootnote\n* Based on a comparison of the non-promotional grocery rewards earn rate on cash back credit cards with no annual fee as of June 1, 2021.\nWelcome offer\nGet up to 5% cash back in your first 3 months‡‡ and a 1.99% introductory interest rate on balance transfers for 9 months with a 1% transfer fee.§§\nAPPL
我需要从一堆使用 JavaScript 呈现的网页中提取文本。
下面的代码通常对我有用,只生成文本和 returns 行,这很好。
但是在某些页面上它不起作用。
如何使用 RSelenium 提取“URL 失败”指示网页的正文文本?
library("tidyverse")
library("rvest")
library("RSelenium")
remDr <- remoteDriver(port = 4445L)
remDr$open()
# URL Works
url <- "https://www.td.com/ca/en/personal-banking/products/credit-cards/travel-rewards/rewards-visa-card/"
# URL Fails
# url <- "https://www.bmo.com/main/personal/credit-cards/bmo-cashback-mastercard/"
remDr$navigate(url)
pg <-
remDr$getPageSource()[[1]] %>%
read_html(encoding = "UTF-8") %>%
html_node(xpath = "//body") %>%
as.character() %>%
htm2txt::htm2txt()
remDr$close()
@NadPat 提出的解决方案
url <- "https://www.bmo.com/main/personal/credit-cards/bmo-cashback-mastercard/"
remDr$navigate(url)
text <- remDr$findElement(using = 'xpath', value = '/html')
text$getElementText()
我的结果:
Selenium message:a is null
Build info: version: '2.53.1', revision: 'a36b8b1', time: '2016-06-30 17:37:03'
System info: host: 'fe72a1de69e7', ip: '172.17.0.2', os.name: 'Linux', os.arch: 'amd64', os.version: '5.4.0-84-generic', java.version: '1.8.0_91'
Driver info: driver.version: unknown
Error: Summary: UnknownError
Detail: An unknown server-side error occurred while processing the command.
class: org.openqa.selenium.WebDriverException
Further Details: run errorDetails method
对于失败的 URL 正在读取某些内容,因为
remDr$getPageSource()[[1]]
returns:
[1] "<html xmlns=\"http://www.w3.org/1999/xhtml\"><head><script>\n\nsitePrefix = 'BMO';\nvar pageNameMapping = {};\n\n//channelDemo\npageNameMapping[\"atm_en\"]=\"channelDemo\";\npageNameMapping[\"atm_fr\"]=\"channelDemo\";\n\n//Every Day Banking\npageNameMapping[\"Personal\"]=\"PERS\";\npageNameMapping[\"Bank Accounts\"]=\"Bank-Accounts\";\npageNameMapping[\"Daily savings account\"]=\"Premium-Rate-Savings\";\npageNameMapping[\"High Interest Savings Account\"]=\"Smart-Saver\";\npageNameMapping[\"Chequing account\"]=\"Primary-Chequing\";\npageNameMapping[\"Business Premium Rate Savings\"]=\"Business Premium Rate Account\";\n\n//Cards\npageNameMapping[\"Credit Cards\"]=\"CC\";\n\n\n//Mortgages\npageNameMapping[\"Mortgages\"]=\"MTG\";\npageNameMapping[\"Special Offers\"]=\"Special-Offers\";\n\n//Wealth Management\npageNameMapping[\"Wealth Management\"]=\"Wealth\";\npageNameMapping[\"AdviceDirect\"]=\"Advicedirect\";\n\n//Online Investing\npageNameMapping[\"Online Investing\"]=\"ONL-INVS\";\npageNameMapping...
我使用 Docker 设置 RSelenium 的方式有问题吗?
=======================
更新:
我从 docker 中提取了最新版本的 standalone-firefox
,现在 @NadPat 的解决方案对我有用。
docker pull selenium/standalone-firefox:latest
正在启动浏览器,
library(RSelenium)
driver = rsDriver(
port = 4841L,
browser = c("firefox"))
remDr <- driver[["client"]]
url <- "https://www.bmo.com/main/personal/credit-cards/bmo-cashback-mastercard/"
第一种方法,
remDr$navigate(url)
text <- remDr$findElement(using = 'xpath', value = '/html')
text$getElementText()
[[1]]
[1] "Skip navigation\nPersonal\nPrivate Wealth\nBusiness\nCommercial\nCapital Markets\nSearch\nFind us\nSupport\nEN\nLogin\nBank Accounts\nCredit Cards\nMortgages\nLoans & Lines of Credit\nInvestments\nFinancial Planning\nInsurance\nWays to Bank\nAbout BMO\nPersonal\nCredit Cards\nBMO CashBack Mastercard\nBMO CashBack® Mastercard®*\nEnjoy the most cash back on groceries in Canada without paying an annual fee\nfootnote\n*\nFootnote\n* Based on a comparison of the non-promotional groce
第二种方法,
text <- remDr$findElement(using = 'xpath', value = '//*[@id="main"]')
text$getElementText()
[[1]]
[1] "Personal\nCredit Cards\nBMO CashBack Mastercard\nBMO CashBack® Mastercard®*\nEnjoy the most cash back on groceries in Canada without paying an annual fee\nfootnote\n*\nFootnote\n* Based on a comparison of the non-promotional grocery rewards earn rate on cash back credit cards with no annual fee as of June 1, 2021.\nWelcome offer\nGet up to 5% cash back in your first 3 months‡‡ and a 1.99% introductory interest rate on balance transfers for 9 months with a 1% transfer fee.§§\nAPPL