使用 RSelenium 在 R 中抓取 Javascript
Scraping Javascript in R with RSelenium
我正在尝试抓取 Washington Post's database on police shootings. Since it's not html I can't use rvest
, so instead I used RSelenium and phantomjs。
library(RSelenium)
checkForServer()
startServer()
eCap <- list(phantomjs.binary.path = "C:/Program Files/Chrome Driver/phantomjs.exe")
remDr <- remoteDriver(browserName = "phantomjs", extraCapabilities = eCap)
remDr$open()
remDr$navigate("http://www.washingtonpost.com/graphics/national/police-shootings/")
检查来源后,很明显我感兴趣的项目有以下 id
和 class
<div id="js-list-690" class="listWrapper cf">
或 Chrome:
我可以访问单个项目的文本:
remDr$findElement("css", "#js-list-691")$getElementText()
returns
[[1]]
[1] "An unidentified person, a 47-year-old Hispanic man, was shocked with a stun gun and shot on July 30, 2015, in Whittier, Calif. Los Angeles County deputies were investigating a domestic disturbance when he threatened the officers and struck one of them with a metal rod.\nMALEDEADLY WEAPONHISPANIC45 TO 54\nCBS Los AngelesWhittier Daily News"}
但是如果我想获得所有这些项目的列表:
remDr$findElements("class name", "listWrapper cf")
导致错误。
我怎么
- 获取所有共享此元素的列表
listWrapper cf
class?
- Return 与每个元素关联的文本列表?
直接使用 JSON 数据会 方式 更容易(在几乎所有现代浏览器中使用 "Developer Tools" 来跟踪加载的 URL ...没多久就在该列表中找到了):
library(jsonlite)
url <- "https://js.washingtonpost.com/graphics/policeshootings/policeshootings.json?d14385542"
shootings <- fromJSON(url)
dplyr::glimpse(shootings)
## Observations: 564
## Variables:
## $ id (int) 3, 4, 5, 8, 9, 11, 13, 15, 16, 17, 19, 21, ...
## $ date (chr) "2015-01-02", "2015-01-02", "2015-01-03", "...
## $ description (chr) "Elliot, who was on medication for depressi...
## $ blurb (chr) "a 53-year-old man of Asian heritage armed ...
## $ name (chr) "Tim Elliot", "Lewis Lee Lembke", "John Pau...
## $ age (int) 53, 47, 23, 32, 39, 18, 22, 35, 34, 47, 25,...
## $ gender (chr) "M", "M", "M", "M", "M", "M", "M", "M", "F"...
## $ race (chr) "A", "W", "H", "W", "H", "W", "H", "W", "W"...
## $ armed (chr) "gun", "gun", "unarmed", "toy weapon", "nai...
## $ city (chr) "Shelton", "Aloha", "Wichita", "San Francis...
## $ state (chr) "WA", "OR", "KS", "CA", "CO", "OK", "AZ", "...
## $ address (chr) "600 block of E. Island Lake Drive", "4519 ...
## $ lat (dbl) 47.24683, 45.48620, 37.69477, 37.76291, 40....
## $ lon (dbl) -123.12159, -122.89128, -97.28055, -122.422...
## $ is_geocoding_exact (lgl) TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, T...
## $ mental (lgl) TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, FAL...
## $ sources (list) http://kbkw.com/local-news/329755, http://...
## $ photos (list) NULL, NULL, 107, , , , //img.washingtonpos...
## $ videos (list) NULL, NULL, NULL, NULL, NULL, NULL, NULL, ...
我正在尝试抓取 Washington Post's database on police shootings. Since it's not html I can't use rvest
, so instead I used RSelenium and phantomjs。
library(RSelenium)
checkForServer()
startServer()
eCap <- list(phantomjs.binary.path = "C:/Program Files/Chrome Driver/phantomjs.exe")
remDr <- remoteDriver(browserName = "phantomjs", extraCapabilities = eCap)
remDr$open()
remDr$navigate("http://www.washingtonpost.com/graphics/national/police-shootings/")
检查来源后,很明显我感兴趣的项目有以下 id
和 class
<div id="js-list-690" class="listWrapper cf">
或 Chrome:
我可以访问单个项目的文本:
remDr$findElement("css", "#js-list-691")$getElementText()
returns
[[1]]
[1] "An unidentified person, a 47-year-old Hispanic man, was shocked with a stun gun and shot on July 30, 2015, in Whittier, Calif. Los Angeles County deputies were investigating a domestic disturbance when he threatened the officers and struck one of them with a metal rod.\nMALEDEADLY WEAPONHISPANIC45 TO 54\nCBS Los AngelesWhittier Daily News"}
但是如果我想获得所有这些项目的列表:
remDr$findElements("class name", "listWrapper cf")
导致错误。
我怎么
- 获取所有共享此元素的列表
listWrapper cf
class? - Return 与每个元素关联的文本列表?
直接使用 JSON 数据会 方式 更容易(在几乎所有现代浏览器中使用 "Developer Tools" 来跟踪加载的 URL ...没多久就在该列表中找到了):
library(jsonlite)
url <- "https://js.washingtonpost.com/graphics/policeshootings/policeshootings.json?d14385542"
shootings <- fromJSON(url)
dplyr::glimpse(shootings)
## Observations: 564
## Variables:
## $ id (int) 3, 4, 5, 8, 9, 11, 13, 15, 16, 17, 19, 21, ...
## $ date (chr) "2015-01-02", "2015-01-02", "2015-01-03", "...
## $ description (chr) "Elliot, who was on medication for depressi...
## $ blurb (chr) "a 53-year-old man of Asian heritage armed ...
## $ name (chr) "Tim Elliot", "Lewis Lee Lembke", "John Pau...
## $ age (int) 53, 47, 23, 32, 39, 18, 22, 35, 34, 47, 25,...
## $ gender (chr) "M", "M", "M", "M", "M", "M", "M", "M", "F"...
## $ race (chr) "A", "W", "H", "W", "H", "W", "H", "W", "W"...
## $ armed (chr) "gun", "gun", "unarmed", "toy weapon", "nai...
## $ city (chr) "Shelton", "Aloha", "Wichita", "San Francis...
## $ state (chr) "WA", "OR", "KS", "CA", "CO", "OK", "AZ", "...
## $ address (chr) "600 block of E. Island Lake Drive", "4519 ...
## $ lat (dbl) 47.24683, 45.48620, 37.69477, 37.76291, 40....
## $ lon (dbl) -123.12159, -122.89128, -97.28055, -122.422...
## $ is_geocoding_exact (lgl) TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, T...
## $ mental (lgl) TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, FAL...
## $ sources (list) http://kbkw.com/local-news/329755, http://...
## $ photos (list) NULL, NULL, 107, , , , //img.washingtonpos...
## $ videos (list) NULL, NULL, NULL, NULL, NULL, NULL, NULL, ...