如何使用 R 从网页中提取所有可见文本
How to extract all visible text from a webpage using R
我需要此页面的可见文本:https://www.americanexpress.com/ca/en/credit-cards/simply-cash-preferred/
起初,我认为 RSelenium 会起作用。但我无法弄清楚如何获得可见文本的 all。
library("RSelenium")
library("rvest")
remDr <- remoteDriver(port = 4445L)
remDr$open()
remDr$navigate("https://www.americanexpress.com/ca/en/credit-cards/simply-cash-preferred")
remDr$getPageSource(header = TRUE)[[1]] %>% read_html()
# or
remDr$findElement(using='css selector',"body")$getElementText()
接下来,我了解了 getURLContent
library("RCurl")
library("XML")
url <- "https://www.americanexpress.com/ca/en/credit-cards/simply-cash-preferred"
x <- getURLContent(url)
x
但在尝试时收到此消息:
[1] "Found. Redirecting to /ca/en/credit-cards/simply-cash-preferred/"
attr(,"Content-Type")
charset
"text/plain" "utf-8"
我不确定如何使用 getURLContent 获取此特定页面的内容。
由于该页面有很多 javascript,所以结合使用 Rselenium、rvest 和 htm2txt 会很有帮助。 htm2txt::htm2txt()
函数将处理(即解析或删除)大量 javascript 格式片段,使用普通 rvest 很难排除这些片段。
library(RSelenium)
library(rvest)
library(htm2txt)
library(tidyverse)
rD <- rsDriver(browser="firefox", port=4545L, verbose=TRUE)
remDr <- rD[["client"]]
remDr$navigate("https://www.americanexpress.com/ca/en/credit-cards/simply-cash-preferred")
captured_text <-
remDr$getPageSource()[[1]] %>%
read_html(encoding = "UTF-8") %>%
html_node(xpath = "//body") %>%
as.character() %>%
htm2txt::htm2txt()
captured_text
[1] "Skip to content\n\nMenuMenu\n\nThe following navigation element is controlled via arrow keys followed by tab\n\nMy Account\nMy Account\n\nPersonal Accounts\n\n• Account Summary\n\n• View Statement\n\n• Manage Account\n\n• Make a Payment\n\n• Manage Pre-Authorized Payment\n\n• Add Someone to Your Account\n\nBusiness Accounts\n\n• Business Account Summary\n\n• American Express @Work\n\n• Merchant Services\n\nOnline Services\n\n• Register for Online Services\n\n• Activate Your Card\n\n• American Express App\n\n• Manage Account Alerts\n\n• Sign Up for Email Offers\n\n• Online-Only Statements\n\nHelp & Support\n\n• Forgot User ID or Password?\n\n• Support 24/7\n\n• Welcome Centre\n\n• Ways to Pay\n\n• Security Centre\n\nCanadaChange Country\n\nEnglish\n\n• Français\n\nCards\nCards\n\nPersonal Cards\n\n• View All Cards\n\n• Cash Back Credit Cards\n\n• Flexible Rewards Cards\n\n• No Annual Fee Cards\n\n• Co-Branded Cards\n\n• Travel Cards\n\nFeatured Cards\n\n• The American Express Aeroplan Reserve Card\n\n• The Cobalt Card\n\n• The SimplyCash Preferred Card\n\n• The Choice Card..."
我需要此页面的可见文本:https://www.americanexpress.com/ca/en/credit-cards/simply-cash-preferred/
起初,我认为 RSelenium 会起作用。但我无法弄清楚如何获得可见文本的 all。
library("RSelenium")
library("rvest")
remDr <- remoteDriver(port = 4445L)
remDr$open()
remDr$navigate("https://www.americanexpress.com/ca/en/credit-cards/simply-cash-preferred")
remDr$getPageSource(header = TRUE)[[1]] %>% read_html()
# or
remDr$findElement(using='css selector',"body")$getElementText()
接下来,我了解了 getURLContent
library("RCurl")
library("XML")
url <- "https://www.americanexpress.com/ca/en/credit-cards/simply-cash-preferred"
x <- getURLContent(url)
x
但在尝试时收到此消息:
[1] "Found. Redirecting to /ca/en/credit-cards/simply-cash-preferred/"
attr(,"Content-Type")
charset
"text/plain" "utf-8"
我不确定如何使用 getURLContent 获取此特定页面的内容。
由于该页面有很多 javascript,所以结合使用 Rselenium、rvest 和 htm2txt 会很有帮助。 htm2txt::htm2txt()
函数将处理(即解析或删除)大量 javascript 格式片段,使用普通 rvest 很难排除这些片段。
library(RSelenium)
library(rvest)
library(htm2txt)
library(tidyverse)
rD <- rsDriver(browser="firefox", port=4545L, verbose=TRUE)
remDr <- rD[["client"]]
remDr$navigate("https://www.americanexpress.com/ca/en/credit-cards/simply-cash-preferred")
captured_text <-
remDr$getPageSource()[[1]] %>%
read_html(encoding = "UTF-8") %>%
html_node(xpath = "//body") %>%
as.character() %>%
htm2txt::htm2txt()
captured_text
[1] "Skip to content\n\nMenuMenu\n\nThe following navigation element is controlled via arrow keys followed by tab\n\nMy Account\nMy Account\n\nPersonal Accounts\n\n• Account Summary\n\n• View Statement\n\n• Manage Account\n\n• Make a Payment\n\n• Manage Pre-Authorized Payment\n\n• Add Someone to Your Account\n\nBusiness Accounts\n\n• Business Account Summary\n\n• American Express @Work\n\n• Merchant Services\n\nOnline Services\n\n• Register for Online Services\n\n• Activate Your Card\n\n• American Express App\n\n• Manage Account Alerts\n\n• Sign Up for Email Offers\n\n• Online-Only Statements\n\nHelp & Support\n\n• Forgot User ID or Password?\n\n• Support 24/7\n\n• Welcome Centre\n\n• Ways to Pay\n\n• Security Centre\n\nCanadaChange Country\n\nEnglish\n\n• Français\n\nCards\nCards\n\nPersonal Cards\n\n• View All Cards\n\n• Cash Back Credit Cards\n\n• Flexible Rewards Cards\n\n• No Annual Fee Cards\n\n• Co-Branded Cards\n\n• Travel Cards\n\nFeatured Cards\n\n• The American Express Aeroplan Reserve Card\n\n• The Cobalt Card\n\n• The SimplyCash Preferred Card\n\n• The Choice Card..."