从不会更改的网站进行网络抓取 URL
web-scraping from a website that does not change URL
我对网络抓取还很陌生,抓取该网站的内容时遇到了一些困难。我基本想收集农药名称和有效成分,但是URL没有变化,我找不到点击格子的方法。有帮助吗?
library(RSelenium)
library(rvest)
library(tidyverse)
rD <- rsDriver(browser="firefox", port=4547L, verbose=F)
remDr <- rD[["client"]]
remDr$navigate("http://www.cdms.net/Label-Database")
此站点调用 API 来获取制造商列表:http://www.cdms.net/labelssds/Home/ManList?Keys=
在产品页面上,它还使用另一个带有制造商 ID 的 API,例如:http://www.cdms.net/labelssds/Home/ProductList?manId=537
您只需遍历 Lst
数组并将结果附加到数据帧。
例如,以下代码获取前 5 个制造商的所有产品:
library(httr)
manufacturers <- content(GET("http://www.cdms.net/labelssds/Home/ManList?Keys="), as = "parsed", type = "application/json")
maxManufacturer <- 5
index <- 1
manufacturerCount <- 0
data = list()
for(m in manufacturers$Lst){
print(m$label)
productUrl <- modify_url("http://www.cdms.net/labelssds/Home/ProductList",
query = list(
"manId" = m$value
)
)
products <- content(GET(productUrl), as = "parsed", type = "application/json")
for(p in products$Lst){
data[[index]] = p
index <- index + 1
}
manufacturerCount <- manufacturerCount + 1
if (manufacturerCount == maxManufacturer){
break
}
Sys.sleep(0.500) #add delay for scraping
}
df <- do.call(rbind, data)
options(width = 1200)
print(df)
我对网络抓取还很陌生,抓取该网站的内容时遇到了一些困难。我基本想收集农药名称和有效成分,但是URL没有变化,我找不到点击格子的方法。有帮助吗?
library(RSelenium)
library(rvest)
library(tidyverse)
rD <- rsDriver(browser="firefox", port=4547L, verbose=F)
remDr <- rD[["client"]]
remDr$navigate("http://www.cdms.net/Label-Database")
此站点调用 API 来获取制造商列表:http://www.cdms.net/labelssds/Home/ManList?Keys=
在产品页面上,它还使用另一个带有制造商 ID 的 API,例如:http://www.cdms.net/labelssds/Home/ProductList?manId=537
您只需遍历 Lst
数组并将结果附加到数据帧。
例如,以下代码获取前 5 个制造商的所有产品:
library(httr)
manufacturers <- content(GET("http://www.cdms.net/labelssds/Home/ManList?Keys="), as = "parsed", type = "application/json")
maxManufacturer <- 5
index <- 1
manufacturerCount <- 0
data = list()
for(m in manufacturers$Lst){
print(m$label)
productUrl <- modify_url("http://www.cdms.net/labelssds/Home/ProductList",
query = list(
"manId" = m$value
)
)
products <- content(GET(productUrl), as = "parsed", type = "application/json")
for(p in products$Lst){
data[[index]] = p
index <- index + 1
}
manufacturerCount <- manufacturerCount + 1
if (manufacturerCount == maxManufacturer){
break
}
Sys.sleep(0.500) #add delay for scraping
}
df <- do.call(rbind, data)
options(width = 1200)
print(df)