R 中的网页抓取搜索结果
webscraping search results in R
我是网络抓取的新手,我正在尝试抓取网站内搜索功能产生的一些数据。我正在使用 rvest 提取信息,但没有得到结果。这是网站:
这就是我 运行宁:
URL <- 'https://www.encompassinsurance.com/agency-locator.aspx#PostalCode=21403&City=&StateProvCd=&Latitude=&Longitude='
webpage <- read_html(URL)
name_html <- html_nodes(webpage,'.locator_result_name')
name_data <- html_text(name_html)
当我 运行 这段代码时,我得到的响应是:
字符(0)
我希望响应是每个公司的名称作为邮政编码搜索的结果(例如“Townley-Kenton Insurance Agency”、“Bradford Turner Insurance Group LLC”)。
我知道此页面上有一些 Javascript,我可能遗漏了一个重要的部分,但鉴于我对 html、CSS、javascript 的了解有限我不确定如何应用 V8 或 PhantomJS 来完成这项工作。
感谢任何帮助。
确实使用 javascript 动态获取数据(通过 XHR GET 请求)。但是,可以使用 httr
包直接从 R 发送此请求。它 returns 一个 JSON 字符串,很容易用 jsonlite
.
解析
几乎所有您想抓取的信息都在数据框中 Info$OfficeInfo
:
library(httr)
library(jsonlite)
res <- content(GET(paste0("https://alr.encompassinsurance.com/",
"?PostalCode=30350&City=&StateProvCd=",
"&Latitude=&Longitude=")), "text")
info <- fromJSON(res)
info$OfficeInfo$Name
#> [1] "Townley-Kenton Insurance Agency"
#> [2] "Bradford Turner Insurance Group LLC"
#> [3] "Arthur J Gallagher Risk Management Services, Inc."
#> [4] "Lanigan Insurance Group Inc"
#> [5] "Haven Insurance Group"
#> [6] "The Leavitt Insurance Group of Atlanta, Incorporated"
#> [7] "Findley Insurance Agency Inc"
#> [8] "Grimes Insurance Agency Inc"
#> [9] "Larry L Talbert Ins Agency DBA Talbert Insurance Services"
#> [10] "The Alliance Group, Inc."
#> [11] "Concierge Insurance Group LLC"
#> [12] "Sutter McLellan & Gilbreath Inc"
#> [13] "The Wichalonis Insurance Agency"
#> [14] "The Beck Agency"
#> [15] "USI Insurance Services LLC"
#> [16] "The Insurance Store"
#> [17] "Southern Insurance Associates of Dunwoody"
#> [18] "D.C.J.D. Corporation DBA The Markey Insurance Group"
#> [19] "DM Services, Incorporated"
#> [20] "Southern Insurance Advisors"
#> [21] "Metro Brokers Insurance Services"
#> [22] "1 Source Insurance, LLC"
#> [23] "The Bates Agency II, LLC"
#> [24] "Risk & Insurance Consultants Inc"
#> [25] "Integrity Insurance & Financial Services Inc"
#> [26] "HN Insurance Services Inc"
#> [27] "Norton Metro LLC"
#> [28] "The Nsure Network LLC"
#> [29] "Henssler Norton Insurance LLC"
#> [30] "Brown & Brown Insurance of Georgia"
#> [31] "America Insurance Brokers, Inc. DBA AIB"
#> [32] "Clear View Insurance Agency"
#> [33] "Relation Insurance Services"
#> [34] "Partners Risk Services LLC"
#> [35] "PointeNorth Insurance Group LLC"
#> [36] "Advanced Insurors Inc"
#> [37] "Mcever & Tribble, Inc."
#> [38] "The Bethea Insurance Group, LLC"
#> [39] "Watchko - Young Ins Agcy Inc"
#> [40] "Sterling Seacrest Partners Inc"
#> [41] "Little & Smith, Incorporated"
#> [42] "LMG Insurance Services Inc"
#> [43] "Granite Risk Advisors LLC"
#> [44] "Mountain Lakes Insurance, LLC"
#> [45] "Hutchinson Traylor Insurance"
#> [46] "Edgewood Partners Insurance Center"
#> [47] "ADC Agency"
#> [48] "MLG Insurance & Financial Services"
#> [49] "Burnette Insurance Agency"
#> [50] "Campbell and Company Enterprise, Incorporated"
由 reprex package (v0.3.0)
于 2020-08-19 创建
我是网络抓取的新手,我正在尝试抓取网站内搜索功能产生的一些数据。我正在使用 rvest 提取信息,但没有得到结果。这是网站:
这就是我 运行宁:
URL <- 'https://www.encompassinsurance.com/agency-locator.aspx#PostalCode=21403&City=&StateProvCd=&Latitude=&Longitude='
webpage <- read_html(URL)
name_html <- html_nodes(webpage,'.locator_result_name')
name_data <- html_text(name_html)
当我 运行 这段代码时,我得到的响应是: 字符(0)
我希望响应是每个公司的名称作为邮政编码搜索的结果(例如“Townley-Kenton Insurance Agency”、“Bradford Turner Insurance Group LLC”)。
我知道此页面上有一些 Javascript,我可能遗漏了一个重要的部分,但鉴于我对 html、CSS、javascript 的了解有限我不确定如何应用 V8 或 PhantomJS 来完成这项工作。
感谢任何帮助。
确实使用 javascript 动态获取数据(通过 XHR GET 请求)。但是,可以使用 httr
包直接从 R 发送此请求。它 returns 一个 JSON 字符串,很容易用 jsonlite
.
几乎所有您想抓取的信息都在数据框中 Info$OfficeInfo
:
library(httr)
library(jsonlite)
res <- content(GET(paste0("https://alr.encompassinsurance.com/",
"?PostalCode=30350&City=&StateProvCd=",
"&Latitude=&Longitude=")), "text")
info <- fromJSON(res)
info$OfficeInfo$Name
#> [1] "Townley-Kenton Insurance Agency"
#> [2] "Bradford Turner Insurance Group LLC"
#> [3] "Arthur J Gallagher Risk Management Services, Inc."
#> [4] "Lanigan Insurance Group Inc"
#> [5] "Haven Insurance Group"
#> [6] "The Leavitt Insurance Group of Atlanta, Incorporated"
#> [7] "Findley Insurance Agency Inc"
#> [8] "Grimes Insurance Agency Inc"
#> [9] "Larry L Talbert Ins Agency DBA Talbert Insurance Services"
#> [10] "The Alliance Group, Inc."
#> [11] "Concierge Insurance Group LLC"
#> [12] "Sutter McLellan & Gilbreath Inc"
#> [13] "The Wichalonis Insurance Agency"
#> [14] "The Beck Agency"
#> [15] "USI Insurance Services LLC"
#> [16] "The Insurance Store"
#> [17] "Southern Insurance Associates of Dunwoody"
#> [18] "D.C.J.D. Corporation DBA The Markey Insurance Group"
#> [19] "DM Services, Incorporated"
#> [20] "Southern Insurance Advisors"
#> [21] "Metro Brokers Insurance Services"
#> [22] "1 Source Insurance, LLC"
#> [23] "The Bates Agency II, LLC"
#> [24] "Risk & Insurance Consultants Inc"
#> [25] "Integrity Insurance & Financial Services Inc"
#> [26] "HN Insurance Services Inc"
#> [27] "Norton Metro LLC"
#> [28] "The Nsure Network LLC"
#> [29] "Henssler Norton Insurance LLC"
#> [30] "Brown & Brown Insurance of Georgia"
#> [31] "America Insurance Brokers, Inc. DBA AIB"
#> [32] "Clear View Insurance Agency"
#> [33] "Relation Insurance Services"
#> [34] "Partners Risk Services LLC"
#> [35] "PointeNorth Insurance Group LLC"
#> [36] "Advanced Insurors Inc"
#> [37] "Mcever & Tribble, Inc."
#> [38] "The Bethea Insurance Group, LLC"
#> [39] "Watchko - Young Ins Agcy Inc"
#> [40] "Sterling Seacrest Partners Inc"
#> [41] "Little & Smith, Incorporated"
#> [42] "LMG Insurance Services Inc"
#> [43] "Granite Risk Advisors LLC"
#> [44] "Mountain Lakes Insurance, LLC"
#> [45] "Hutchinson Traylor Insurance"
#> [46] "Edgewood Partners Insurance Center"
#> [47] "ADC Agency"
#> [48] "MLG Insurance & Financial Services"
#> [49] "Burnette Insurance Agency"
#> [50] "Campbell and Company Enterprise, Incorporated"
由 reprex package (v0.3.0)
于 2020-08-19 创建