rvest follow_link 将我带回到同一页面
rvest follow_link brings me back to the same page
我正在尝试从新闻网站获取文本。搜索将我带到我通常用 rvest follow_link
解决的分页序列。然而在这种情况下,我仍然回到第 1 页而不是第 2 页、第 3 页等...
知道为什么会这样吗?
library(tidyverse)
library(rvest)
library(httr)
url = "https://www.milenio.com"
UserAgent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36"
MySession = html_session(
url = url,
user_agent(UserAgent)
)
page = MySession %>%
jump_to(url = 'buscador/page/2?text=violencia')
page
page2 = page %>%
follow_link(css = ".number-pages-container span:nth-child(2) a")
page2
我添加了一些额外的 headers 并遵循了搜索页面 > 包含查询字符串的页面 > 第 2 页 link 的顺序。我这样做是因为我认为需要一定顺序的 cookie。
library(tidyverse)
library(rvest)
library(httr)
url = "https://www.milenio.com"
MySession = html_session(
url = 'https://www.milenio.com/buscador',
add_headers('accept-language' ='en-GB,en-US;q=0.9,en;q=0.8',
'user-agent' ="Mozilla/5.0",
'referer' = 'https://www.milenio.com/buscador',
)
)
page <- MySession %>%
session_jump_to(url = '/buscador?text=violencia')
page2 <- page %>%
session_follow_link(css = ".number-pages-container span:nth-child(2) a")
page2 %>% html_element('.headline-number') %>% html_text()
我正在尝试从新闻网站获取文本。搜索将我带到我通常用 rvest follow_link
解决的分页序列。然而在这种情况下,我仍然回到第 1 页而不是第 2 页、第 3 页等...
知道为什么会这样吗?
library(tidyverse)
library(rvest)
library(httr)
url = "https://www.milenio.com"
UserAgent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36"
MySession = html_session(
url = url,
user_agent(UserAgent)
)
page = MySession %>%
jump_to(url = 'buscador/page/2?text=violencia')
page
page2 = page %>%
follow_link(css = ".number-pages-container span:nth-child(2) a")
page2
我添加了一些额外的 headers 并遵循了搜索页面 > 包含查询字符串的页面 > 第 2 页 link 的顺序。我这样做是因为我认为需要一定顺序的 cookie。
library(tidyverse)
library(rvest)
library(httr)
url = "https://www.milenio.com"
MySession = html_session(
url = 'https://www.milenio.com/buscador',
add_headers('accept-language' ='en-GB,en-US;q=0.9,en;q=0.8',
'user-agent' ="Mozilla/5.0",
'referer' = 'https://www.milenio.com/buscador',
)
)
page <- MySession %>%
session_jump_to(url = '/buscador?text=violencia')
page2 <- page %>%
session_follow_link(css = ".number-pages-container span:nth-child(2) a")
page2 %>% html_element('.headline-number') %>% html_text()