嵌套 For-Loop 无法存储上一次迭代的数据
Nested For-Loop Failed To Store Data From Previous Iteration
实际上我是网络抓取的新手,昨晚才了解它。
简介:
我正在登录帐户时尝试抓取 Science Direct 网页。
我试图在每次迭代中存储所有标题(有 3 个页面,即 3 次迭代),对于每次迭代,我必须抓取我做了另一个 for 循环以读取 25 个唯一 ID每次迭代中的每个标题。
但是,它只存储了上一次迭代(第 3 页)的标题。
我知道代码在我只抓取一个页面时有效,但是当我尝试使用第一个 for 循环抓取 'Next' 页面时:
'''
for (i in seq (from = 0, to = 50, by = 25)) {
'''
正如我之前所说,代码只存储了最后一次迭代(即包含 25 篇文章的第 3 页)。
顺便说一句,每个页面都包含一个选项,每页显示几篇文章,可以是 25、50 或 100 篇文章,我选择了 25,因此序列 = 25。
代码如下:
#install.packages("xml2") # required for rvest
library("rvest") # for web scraping
library("dplyr") # for data management
titleNo = c()
name = list()
for(i in seq(from = 0, to = 50, by = 25)) {
link = paste0("https://www.sciencedirect.com/search?qs=PISA%2C%20Programme%20for%20International%20Student%20Assessment&date=2010-2021&articleTypes=FLA&lastSelectedFacet=subjectAreas&subjectAreas=3300%2C3200%2C2800%2C2000%2C1200%2C1700%2C1400%2C1800%2C2200&offset=",i,"")
for(j in 1:26) {
page = read_html(link)
titleNo[j] = (paste0(".push-m:nth-child(",j,") h2"))
name[j] <- list(page %>% html_nodes(titleNo[j])%>% html_text())
}
print(paste(i))
}
name <- data.frame(unlist(name))
你们能指出我做错了什么吗?
代码成功运行所有页面,但我的问题是,对于每次迭代,代码都会清除名称变量并存储新变量,直到最后一次迭代。
我认为我的问题出在我的 for-loop 上。我不确定我做的是否正确。
谢谢
我认为你把这个复杂化了。您可以使用适当的 css-selectors.
在 one-go 中提取 25 个标题
然后您可以 unlist
将结果作为一个组合向量。
library(rvest)
values <- seq (from = 25, to = 50, by = 25)
link <- paste0("https://www.sciencedirect.com/search?qs=PISA%2C%20Programme%20for%20International%20Student%20Assessment&date=2010-2021&articleTypes=FLA&lastSelectedFacet=subjectAreas&subjectAreas=3300%2C3200%2C2800%2C2000%2C1200%2C1700%2C1400%2C1800%2C2200&offset=", values)
result <- lapply(link, function(x) x %>%
read_html() %>%
html_nodes('div.result-item-content h2 span a') %>%
html_text())
titles <- unlist(result)
titles
#[1] "Computer-generated log-file analyses as a window into students' minds? A showcase study based on the PISA 2012 assessment of problem solving"
#[2] "The Comparison between Successful and Unsuccessful Countries in PISA, 2009"
#[3] "Educational Data Mining: Identification of factors associated with school effectiveness in PISA assessment"
#[4] "Curriculum standardization, stratification, and students’ STEM-related occupational expectations: Evidence from PISA 2006"
#[5] "Testing measurement invariance of PISA 2015 mathematics, science, and ICT scales using the alignment method"
#[6] "Effects of students’ and schools’ characteristics on mathematics achievement: findings from PISA 2006"
#...
#...
实际上我是网络抓取的新手,昨晚才了解它。
简介:
我正在登录帐户时尝试抓取 Science Direct 网页。
我试图在每次迭代中存储所有标题(有 3 个页面,即 3 次迭代),对于每次迭代,我必须抓取我做了另一个 for 循环以读取 25 个唯一 ID每次迭代中的每个标题。
但是,它只存储了上一次迭代(第 3 页)的标题。
我知道代码在我只抓取一个页面时有效,但是当我尝试使用第一个 for 循环抓取 'Next' 页面时:
'''
for (i in seq (from = 0, to = 50, by = 25)) {
'''
正如我之前所说,代码只存储了最后一次迭代(即包含 25 篇文章的第 3 页)。
顺便说一句,每个页面都包含一个选项,每页显示几篇文章,可以是 25、50 或 100 篇文章,我选择了 25,因此序列 = 25。
代码如下:
#install.packages("xml2") # required for rvest
library("rvest") # for web scraping
library("dplyr") # for data management
titleNo = c()
name = list()
for(i in seq(from = 0, to = 50, by = 25)) {
link = paste0("https://www.sciencedirect.com/search?qs=PISA%2C%20Programme%20for%20International%20Student%20Assessment&date=2010-2021&articleTypes=FLA&lastSelectedFacet=subjectAreas&subjectAreas=3300%2C3200%2C2800%2C2000%2C1200%2C1700%2C1400%2C1800%2C2200&offset=",i,"")
for(j in 1:26) {
page = read_html(link)
titleNo[j] = (paste0(".push-m:nth-child(",j,") h2"))
name[j] <- list(page %>% html_nodes(titleNo[j])%>% html_text())
}
print(paste(i))
}
name <- data.frame(unlist(name))
你们能指出我做错了什么吗?
代码成功运行所有页面,但我的问题是,对于每次迭代,代码都会清除名称变量并存储新变量,直到最后一次迭代。
我认为我的问题出在我的 for-loop 上。我不确定我做的是否正确。
谢谢
我认为你把这个复杂化了。您可以使用适当的 css-selectors.
在 one-go 中提取 25 个标题然后您可以 unlist
将结果作为一个组合向量。
library(rvest)
values <- seq (from = 25, to = 50, by = 25)
link <- paste0("https://www.sciencedirect.com/search?qs=PISA%2C%20Programme%20for%20International%20Student%20Assessment&date=2010-2021&articleTypes=FLA&lastSelectedFacet=subjectAreas&subjectAreas=3300%2C3200%2C2800%2C2000%2C1200%2C1700%2C1400%2C1800%2C2200&offset=", values)
result <- lapply(link, function(x) x %>%
read_html() %>%
html_nodes('div.result-item-content h2 span a') %>%
html_text())
titles <- unlist(result)
titles
#[1] "Computer-generated log-file analyses as a window into students' minds? A showcase study based on the PISA 2012 assessment of problem solving"
#[2] "The Comparison between Successful and Unsuccessful Countries in PISA, 2009"
#[3] "Educational Data Mining: Identification of factors associated with school effectiveness in PISA assessment"
#[4] "Curriculum standardization, stratification, and students’ STEM-related occupational expectations: Evidence from PISA 2006"
#[5] "Testing measurement invariance of PISA 2015 mathematics, science, and ICT scales using the alignment method"
#[6] "Effects of students’ and schools’ characteristics on mathematics achievement: findings from PISA 2006"
#...
#...