在 R 中,使用 read_html 读取的网站被重定向。如何获取它被重定向到的 url?
In R, website read with read_html is redirected. How to get the url that it was redirected to?
this_page = read_html("https://apu.edu/athletics")
> this_page
{xml_document}
<html id="ctl00_html" lang="en" class=" index homepage">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n<script>window.client_hostname = "athletics.apu.edu";window.server_name = "79077 ...
[2] <body>\n<div style="position: fixed; left: -10000px"><script src="//cdn.blueconic.net/azusa.js" async=""></script></div>\n<script>(function(i,s,o,g,r,a,m){i[ ...
虽然我们读到 https://apu.edu/athletics
,但它重定向到 athletics.apu.edu
。这在浏览器中都是正确的,也可以在 this_page
的输出中看到:<script>window.client_hostname = "athletics.apu.edu"; ...
是否可以从 this_page
变量中提取该值?
编辑: 当前排名前 3 的答案(ekoam、David、Allan)都有效,并且都花费了相同的时间(0.35 秒)。我已经接受了 trace_redirects
的答案,因为它提供了所有重定向的附加信息...
如果您改用 html_session()
,它应该可以工作:
library(rvest)
url <- "https://apu.edu/athletics"
s <- html_session(url)
s
#> <session> https://athletics.apu.edu/
#> Status: 200
#> Type: text/html; charset=utf-8
#> Size: 221620
s$url
#> [1] "https://athletics.apu.edu/"
如果您不介意为此使用 httr
,那么只需:
httr::GET("https://apu.edu/athletics")[["url"]]
> httr::GET("https://apu.edu/athletics")[["url"]]
[1] "https://athletics.apu.edu/"
如果你想得到所有的重定向(你这里实际上被重定向了两次),你可以使用这个函数:
trace_redirects <- function(url) {
httr::GET(url)$all_headers %>%
lapply(function(x) x$headers$location) %>%
unlist() %>%
unique()
}
所以你可以这样做:
trace_redirects("https://apu.edu/athletics")
#> [1] "https://www.apu.edu/athletics" "http://athletics.apu.edu"
#> [3] "https://athletics.apu.edu/"
this_page = read_html("https://apu.edu/athletics")
> this_page
{xml_document}
<html id="ctl00_html" lang="en" class=" index homepage">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n<script>window.client_hostname = "athletics.apu.edu";window.server_name = "79077 ...
[2] <body>\n<div style="position: fixed; left: -10000px"><script src="//cdn.blueconic.net/azusa.js" async=""></script></div>\n<script>(function(i,s,o,g,r,a,m){i[ ...
虽然我们读到 https://apu.edu/athletics
,但它重定向到 athletics.apu.edu
。这在浏览器中都是正确的,也可以在 this_page
的输出中看到:<script>window.client_hostname = "athletics.apu.edu"; ...
是否可以从 this_page
变量中提取该值?
编辑: 当前排名前 3 的答案(ekoam、David、Allan)都有效,并且都花费了相同的时间(0.35 秒)。我已经接受了 trace_redirects
的答案,因为它提供了所有重定向的附加信息...
如果您改用 html_session()
,它应该可以工作:
library(rvest)
url <- "https://apu.edu/athletics"
s <- html_session(url)
s
#> <session> https://athletics.apu.edu/
#> Status: 200
#> Type: text/html; charset=utf-8
#> Size: 221620
s$url
#> [1] "https://athletics.apu.edu/"
如果您不介意为此使用 httr
,那么只需:
httr::GET("https://apu.edu/athletics")[["url"]]
> httr::GET("https://apu.edu/athletics")[["url"]]
[1] "https://athletics.apu.edu/"
如果你想得到所有的重定向(你这里实际上被重定向了两次),你可以使用这个函数:
trace_redirects <- function(url) {
httr::GET(url)$all_headers %>%
lapply(function(x) x$headers$location) %>%
unlist() %>%
unique()
}
所以你可以这样做:
trace_redirects("https://apu.edu/athletics")
#> [1] "https://www.apu.edu/athletics" "http://athletics.apu.edu"
#> [3] "https://athletics.apu.edu/"