在 R 中使用 url 的列表,如何通过网络抓取图像、下载文件并将图像分组回原始 url?

Using a list of urls in R, How to web scrape images, download the files and group the images back to original url?

我有一个向量 URLs

library(rvest)
URLs <-c("https://www.espn.com/f1/story/_/id/31287940/norris-made-step-says-mclaren",
"https://www.espn.com/f1/story/_/id/31287893/vettel-calls-fia-not-very-professional-imola-penalty",
"https://www.espn.com/f1/story/_/id/31284743/alonso-promoted-points-finish-raikkonen-penalty")

我想遍历这些并为该页面上的所有图片创建一个图像列表 link,着陆页是列表中元素的名称。然后,我希望下载图片时同时附上图片和着陆页 urls。

我目前的代码一次只能用于一页。

REPREX 的单个 URL

url <-c("https://www.espn.com/f1/story/_/id/31287940/norris-made-step-says-mclaren")
webpage <- html_session(url)
link.titles <- webpage %>% html_nodes("img")
img.url <- link.titles[2] %>% html_attr("src")

### Issue #1 i could not figure out the loop with html_attr to spit out all of the urls in a list###
download.file(img.url, "test.jpg", mode = "wb")

###Issue #2 because of this I cannot loop through a list and download the names### 

一旦我能够下载图片列表,url 问题就可以轻松解决,方法是用定界符分隔着陆页和图片 url 来命名文件。我可以为每个 URL 创建一个 ID 以减少文件长度。命名约定,“LP1-ESPN.com.jpg”

这样做的目的是快速浏览每个link的照片,删除不相关的照片,统计每个link的照片数量,然后link统计(在手动删除不相关的照片后)和 links 返回到原始数据集,该数据集具有其他用于分析的指标。这就是为什么我想要上面的命名约定的原因,这样我就可以从 r 中的文件夹中加载剩余的名称和 links,而无需操作 jpg 文件。

编辑:我已经能够获得我所有 url 的列表以及其中的图像 link。我无法通过此循环下载它们。很多都丢失了。 以下代码有效,但只下载了大约 10% 的图像。我知道 html_session 可以解决这个问题,但是大约有 2500 张图像 link,我无法弄清楚如何循环处理会话。也许是一个 while 循环?

tryCatch(lapply(1:length(total_urls.2$V1), function(x) 
  download.file(new_df[[x]],paste0(total_urls.2[x,3],"_", total_urls.2[x,4],".jpeg"),method = "auto" ,mode = "wb", cacheOK = FALSE)), error = function(e) NULL)

这是有问题的 data.frame:它有 2380 行长,但这里被截断为 50 行。我想知道如何下载所有图片 links,现在我在文件夹中只有大约 19 张图片。


dput(total_urls.2[1:50,])
structure(list(V1 = c("https://a3.espncdn.com/combiner/i?img=%2Fredesign%2Fassets%2Fimg%2Ficons%2FESPN%2Dicon%2Dnascar.png&w=80&h=80&scale=crop&cquality=40&location=origin", 
"https://a.espncdn.com/combiner/i?img=/i/columnists/edmondson_laurence_m.jpg&h=80&w=80&scale=crop", 
"https://a.espncdn.com/combiner/i?img=/photo/2021/0418/r842329_1296x1296_1-1.jpg&w=130&h=130&scale=crop&location=center", 
"https://a3.espncdn.com/combiner/i?img=%2Fredesign%2Fassets%2Fimg%2Ficons%2FESPN%2Dicon%2Dnascar.png&w=80&h=80&scale=crop&cquality=40&location=origin", 
"https://a3.espncdn.com/combiner/i?img=%2Fredesign%2Fassets%2Fimg%2Ficons%2FESPN%2Dicon%2Dnascar.png&w=80&h=80&scale=crop&cquality=40&location=origin", 
"https://a3.espncdn.com/combiner/i?img=%2Fredesign%2Fassets%2Fimg%2Ficons%2FESPN%2Dicon%2Dnascar.png&w=80&h=80&scale=crop&cquality=40&location=origin", 
"https://a3.espncdn.com/combiner/i?img=%2Fredesign%2Fassets%2Fimg%2Ficons%2FESPN%2Dicon%2Dnascar.png&w=80&h=80&scale=crop&cquality=40&location=origin", 
"https://a.espncdn.com/combiner/i?img=/i/columnists/edmondson_laurence_m.jpg&h=80&w=80&scale=crop", 
"https://a.espncdn.com/combiner/i?img=/photo/2021/0418/r842215_1296x1296_1-1.jpg&w=130&h=130&scale=crop&location=center", 
"https://a3.espncdn.com/combiner/i?img=%2Fredesign%2Fassets%2Fimg%2Ficons%2FESPN%2Dicon%2Dnascar.png&w=80&h=80&scale=crop&cquality=40&location=origin", 
"https://a.espncdn.com/icons/in_15.png", "https://a.espncdn.com/icons/in_15.png", 
"https://a3.espncdn.com/combiner/i?img=%2Fredesign%2Fassets%2Fimg%2Ficons%2FESPN%2Dicon%2Dnascar.png&w=80&h=80&scale=crop&cquality=40&location=origin", 
"https://a3.espncdn.com/combiner/i?img=%2Fredesign%2Fassets%2Fimg%2Ficons%2FESPN%2Dicon%2Dnascar.png&w=80&h=80&scale=crop&cquality=40&location=origin", 
"https://a.espncdn.com/combiner/i?img=/photo/2021/0417/r841759_1296x1296_1-1.jpg&w=130&h=130&scale=crop&location=center", 
"https://a.espncdn.com/combiner/i?img=/photo/2021/0331/r834376_1296x1296_1-1.jpg&w=130&h=130&scale=crop&location=center", 
"https://a3.espncdn.com/combiner/i?img=%2Fredesign%2Fassets%2Fimg%2Ficons%2FESPN%2Dicon%2Dnascar.png&w=80&h=80&scale=crop&cquality=40&location=origin", 
"https://a.espncdn.com/combiner/i?img=/photo/2021/0417/r841759_1296x1296_1-1.jpg&w=130&h=130&scale=crop&location=center", 
"https://a.espncdn.com/combiner/i?img=/photo/2021/0331/r834376_1296x1296_1-1.jpg&w=130&h=130&scale=crop&location=center", 
"https://a3.espncdn.com/combiner/i?img=%2Fredesign%2Fassets%2Fimg%2Ficons%2FESPN%2Dicon%2Dnascar.png&w=80&h=80&scale=crop&cquality=40&location=origin", 
"https://a.espncdn.com/combiner/i?img=/i/columnists/edmondson_laurence_m.jpg&h=80&w=80&scale=crop", 
"https://a.espncdn.com/combiner/i?img=/i/columnists/saunders_nate_m.jpg&h=80&w=80&scale=crop", 
"https://a.espncdn.com/icons/in_15.png", "https://a.espncdn.com/icons/in_15.png", 
"https://a3.espncdn.com/combiner/i?img=%2Fredesign%2Fassets%2Fimg%2Ficons%2FESPN%2Dicon%2Dnascar.png&w=80&h=80&scale=crop&cquality=40&location=origin", 
"https://a3.espncdn.com/combiner/i?img=%2Fredesign%2Fassets%2Fimg%2Ficons%2FESPN%2Dicon%2Dnascar.png&w=80&h=80&scale=crop&cquality=40&location=origin", 
"https://a.espncdn.com/combiner/i?img=/i/columnists/edmondson_laurence_m.jpg&h=80&w=80&scale=crop", 
"https://a3.espncdn.com/combiner/i?img=%2Fredesign%2Fassets%2Fimg%2Ficons%2FESPN%2Dicon%2Dnascar.png&w=80&h=80&scale=crop&cquality=40&location=origin", 
"https://a.espncdn.com/combiner/i?img=/photo/2021/0329/r833584_1296x1296_1-1.jpg&w=130&h=130&scale=crop&location=center", 
"https://a.espncdn.com/combiner/i?img=/photo/2021/0415/r840887_1296x1296_1-1.jpg&w=130&h=130&scale=crop&location=center", 
"https://a.espncdn.com/icons/in_15.png", "https://a.espncdn.com/icons/in_15.png", 
"https://a3.espncdn.com/combiner/i?img=%2Fredesign%2Fassets%2Fimg%2Ficons%2FESPN%2Dicon%2Dnascar.png&w=80&h=80&scale=crop&cquality=40&location=origin", 
"https://a3.espncdn.com/combiner/i?img=%2Fredesign%2Fassets%2Fimg%2Ficons%2FESPN%2Dicon%2Dnascar.png&w=80&h=80&scale=crop&cquality=40&location=origin", 
"https://a.espncdn.com/combiner/i?img=/photo/2021/0415/r840882_1296x1296_1-1.jpg&w=130&h=130&scale=crop&location=center", 
"https://a.espncdn.com/combiner/i?img=/photo/2021/0415/r840887_1296x1296_1-1.jpg&w=130&h=130&scale=crop&location=center", 
"https://a.espncdn.com/icons/in_15.png", "https://a.espncdn.com/icons/in_15.png", 
"https://a3.espncdn.com/combiner/i?img=%2Fredesign%2Fassets%2Fimg%2Ficons%2FESPN%2Dicon%2Dnascar.png&w=80&h=80&scale=crop&cquality=40&location=origin", 
"https://a3.espncdn.com/combiner/i?img=%2Fredesign%2Fassets%2Fimg%2Ficons%2FESPN%2Dicon%2Dnascar.png&w=80&h=80&scale=crop&cquality=40&location=origin", 
"https://a.espncdn.com/combiner/i?img=/photo/2021/0413/r839738_1296x1296_1-1.jpg&w=130&h=130&scale=crop&location=center", 
"https://a.espncdn.com/combiner/i?img=/photo/2021/0415/r840882_1296x1296_1-1.jpg&w=130&h=130&scale=crop&location=center", 
"https://a3.espncdn.com/combiner/i?img=%2Fredesign%2Fassets%2Fimg%2Ficons%2FESPN%2Dicon%2Dnascar.png&w=80&h=80&scale=crop&cquality=40&location=origin", 
"https://a.espncdn.com/combiner/i?img=/photo/2021/0329/r833489_1296x1296_1-1.jpg&w=130&h=130&scale=crop&location=center", 
"https://ca-times.brightspotcdn.com/dims4/default/331b349/2147483647/strip/true/crop/3937x2625+0+0/resize/840x560!/quality/90/?url=https%3A%2F%2Fcalifornia-times-brightspot.s3.amazonaws.com%2Fb3%2F61%2F0c10c4344ce0a69870cf8864cf98%2Fitaly-emilia-romagna-f1-gp-auto-racing-58053.jpg", 
"//cdn.cnn.com/cnnnext/dam/assets/200809102318-verstappen-celeb-exlarge-169.jpg", 
"//cdn.cnn.com/cnnnext/dam/assets/200809102318-verstappen-celeb-exlarge-169.jpg", 
"//cdn.cnn.com/cnnnext/dam/assets/200809102318-verstappen-celeb-large-169.jpg", 
"//cdn.cnn.com/cnnnext/dam/assets/210309145427-mick-and-michael-schumacher-large-169.jpg", 
"//cdn.cnn.com/cnnnext/dam/assets/210305121558-stephanie-travers-lewis-hamilton-large-169.jpg"
), URLs = c("https://www.espn.com/f1/story/_/id/31289705/the-blame-game-analysis-bottas-russell-clash-happens-next1", 
"https://www.espn.com/f1/story/_/id/31289705/the-blame-game-analysis-bottas-russell-clash-happens-next2", 
"https://www.espn.com/f1/story/_/id/31289705/the-blame-game-analysis-bottas-russell-clash-happens-next3", 
"https://www.espn.com/f1/story/_/id/31287940/norris-made-step-says-mclaren1", 
"https://www.espn.com/f1/story/_/id/31287893/vettel-calls-fia-not-very-professional-imola-penalty1", 
"https://www.espn.com/f1/story/_/id/31284743/alonso-promoted-points-finish-raikkonen-penalty1", 
"https://www.espn.com/f1/story/_/id/31284241/russell-lost-sight-bigger-picture-bottas-clash1", 
"https://www.espn.com/f1/story/_/id/31284241/russell-lost-sight-bigger-picture-bottas-clash2", 
"https://www.espn.com/f1/story/_/id/31284241/russell-lost-sight-bigger-picture-bottas-clash3", 
"https://www.espn.com/f1/story/_/id/31283343/russell-asked-bottas-trying-kill-us-both1", 
"http://www.espn.com/espn/wire?section=rpm&id=312812861", "http://www.espn.com/espn/wire?section=rpm&id=312812862", 
"http://www.espn.com/espn/wire?section=rpm&id=312812863", "https://www.espn.com/f1/story/_/id/31281017/f1-confirms-race-miami-20221", 
"https://www.espn.com/f1/story/_/id/31281017/f1-confirms-race-miami-20222", 
"https://www.espn.com/f1/story/_/id/31281017/f1-confirms-race-miami-20223", 
"https://www.espn.com/f1/story/_/id/31281017/f1-hold-miami-grand-prix-2022-onwards1", 
"https://www.espn.com/f1/story/_/id/31281017/f1-hold-miami-grand-prix-2022-onwards2", 
"https://www.espn.com/f1/story/_/id/31281017/f1-hold-miami-grand-prix-2022-onwards3", 
"https://www.espn.com/f1/story/_/id/31275927/how-hamilton-mercedes-turned-tables-red-bull1", 
"https://www.espn.com/f1/story/_/id/31275927/how-hamilton-mercedes-turned-tables-red-bull2", 
"https://www.espn.com/f1/story/_/id/31275927/how-hamilton-mercedes-turned-tables-red-bull3", 
"http://www.espn.com/espn/wire?section=rpm&id=312742341", "http://www.espn.com/espn/wire?section=rpm&id=312742342", 
"http://www.espn.com/espn/wire?section=rpm&id=312742343", "https://www.espn.com/f1/story/_/id/31270294/has-mercedes-regained-advantage-already-ferrari-spring-surprise1", 
"https://www.espn.com/f1/story/_/id/31270294/has-mercedes-regained-advantage-already-ferrari-spring-surprise2", 
"https://www.espn.com/f1/story/_/id/31270374/aston-martin-wants-aero-rules-revision1", 
"https://www.espn.com/f1/story/_/id/31270374/aston-martin-wants-aero-rules-revision2", 
"https://www.espn.com/f1/story/_/id/31270374/aston-martin-wants-aero-rules-revision3", 
"http://www.espn.com/espn/wire?section=rpm&id=312676841", "http://www.espn.com/espn/wire?section=rpm&id=312676842", 
"http://www.espn.com/espn/wire?section=rpm&id=312676843", "https://www.espn.com/f1/story/_/id/31263611/hamilton-says-f1-rivalry-vettel-remains-favourite1", 
"https://www.espn.com/f1/story/_/id/31263611/hamilton-says-f1-rivalry-vettel-remains-favourite2", 
"https://www.espn.com/f1/story/_/id/31263611/hamilton-says-f1-rivalry-vettel-remains-favourite3", 
"http://www.espn.com/espn/wire?section=rpm&id=312634771", "http://www.espn.com/espn/wire?section=rpm&id=312634772", 
"http://www.espn.com/espn/wire?section=rpm&id=312634773", "https://www.espn.com/f1/story/_/id/31263349/mercedes-gone-hunted-hunters1", 
"https://www.espn.com/f1/story/_/id/31263349/mercedes-gone-hunted-hunters2", 
"https://www.espn.com/f1/story/_/id/31263349/mercedes-gone-hunted-hunters3", 
"https://www.espn.com/f1/story/_/id/31263278/verstappen-calls-clarity-messy-track-limits-rules1", 
"https://www.espn.com/f1/story/_/id/31263278/verstappen-calls-clarity-messy-track-limits-rules2", 
"https://www.latimes.com/sports/story/2021-04-18/max-verstappen-lewis-hamilton-emilia-romagna-grand-prix1", 
"https://edition.cnn.com/2021/04/18/motorsport/max-verstappen-lewis-hamilton-imola-gp-spt-intl/index.html1", 
"https://edition.cnn.com/2021/04/18/motorsport/max-verstappen-lewis-hamilton-imola-gp-spt-intl/index.html2", 
"https://edition.cnn.com/2021/04/18/motorsport/max-verstappen-lewis-hamilton-imola-gp-spt-intl/index.html4", 
"https://edition.cnn.com/2021/04/18/motorsport/max-verstappen-lewis-hamilton-imola-gp-spt-intl/index.html6", 
"https://edition.cnn.com/2021/04/18/motorsport/max-verstappen-lewis-hamilton-imola-gp-spt-intl/index.html8"
), Article_URL = c("IMESP1", "IMESP2", "IMESP3", "IMESP4", "IMESP5", 
"IMESP6", "IMESP7", "IMESP8", "IMESP9", "IMESP10", "IMESP11", 
"IMESP12", "IMESP13", "IMESP14", "IMESP15", "IMESP16", "IMESP17", 
"IMESP18", "IMESP19", "IMESP20", "IMESP21", "IMESP22", "IMESP23", 
"IMESP24", "IMESP25", "IMESP26", "IMESP27", "IMESP28", "IMESP29", 
"IMESP30", "IMESP31", "IMESP32", "IMESP33", "IMESP34", "IMESP35", 
"IMESP36", "IMESP37", "IMESP38", "IMESP39", "IMESP40", "IMESP41", 
"IMESP42", "IMESP43", "IMESP44", "IMESP45", "IMESP62", "IMESP63", 
"IMESP65", "IMESP67", "IMESP69"), img_URL = c("img21", "img22", 
"img23", "img24", "img25", "img26", "img27", "img28", "img29", 
"img210", "img211", "img212", "img213", "img214", "img215", "img216", 
"img217", "img218", "img219", "img220", "img221", "img222", "img223", 
"img224", "img225", "img226", "img227", "img228", "img229", "img230", 
"img231", "img232", "img233", "img234", "img235", "img236", "img237", 
"img238", "img239", "img240", "img241", "img242", "img243", "img244", 
"img245", "img262", "img263", "img265", "img267", "img269")), row.names = c("https://www.espn.com/f1/story/_/id/31289705/the-blame-game-analysis-bottas-russell-clash-happens-next1", 
"https://www.espn.com/f1/story/_/id/31289705/the-blame-game-analysis-bottas-russell-clash-happens-next2", 
"https://www.espn.com/f1/story/_/id/31289705/the-blame-game-analysis-bottas-russell-clash-happens-next3", 
"https://www.espn.com/f1/story/_/id/31287940/norris-made-step-says-mclaren1", 
"https://www.espn.com/f1/story/_/id/31287893/vettel-calls-fia-not-very-professional-imola-penalty1", 
"https://www.espn.com/f1/story/_/id/31284743/alonso-promoted-points-finish-raikkonen-penalty1", 
"https://www.espn.com/f1/story/_/id/31284241/russell-lost-sight-bigger-picture-bottas-clash1", 
"https://www.espn.com/f1/story/_/id/31284241/russell-lost-sight-bigger-picture-bottas-clash2", 
"https://www.espn.com/f1/story/_/id/31284241/russell-lost-sight-bigger-picture-bottas-clash3", 
"https://www.espn.com/f1/story/_/id/31283343/russell-asked-bottas-trying-kill-us-both1", 
"http://www.espn.com/espn/wire?section=rpm&id=312812861", "http://www.espn.com/espn/wire?section=rpm&id=312812862", 
"http://www.espn.com/espn/wire?section=rpm&id=312812863", "https://www.espn.com/f1/story/_/id/31281017/f1-confirms-race-miami-20221", 
"https://www.espn.com/f1/story/_/id/31281017/f1-confirms-race-miami-20222", 
"https://www.espn.com/f1/story/_/id/31281017/f1-confirms-race-miami-20223", 
"https://www.espn.com/f1/story/_/id/31281017/f1-hold-miami-grand-prix-2022-onwards1", 
"https://www.espn.com/f1/story/_/id/31281017/f1-hold-miami-grand-prix-2022-onwards2", 
"https://www.espn.com/f1/story/_/id/31281017/f1-hold-miami-grand-prix-2022-onwards3", 
"https://www.espn.com/f1/story/_/id/31275927/how-hamilton-mercedes-turned-tables-red-bull1", 
"https://www.espn.com/f1/story/_/id/31275927/how-hamilton-mercedes-turned-tables-red-bull2", 
"https://www.espn.com/f1/story/_/id/31275927/how-hamilton-mercedes-turned-tables-red-bull3", 
"http://www.espn.com/espn/wire?section=rpm&id=312742341", "http://www.espn.com/espn/wire?section=rpm&id=312742342", 
"http://www.espn.com/espn/wire?section=rpm&id=312742343", "https://www.espn.com/f1/story/_/id/31270294/has-mercedes-regained-advantage-already-ferrari-spring-surprise1", 
"https://www.espn.com/f1/story/_/id/31270294/has-mercedes-regained-advantage-already-ferrari-spring-surprise2", 
"https://www.espn.com/f1/story/_/id/31270374/aston-martin-wants-aero-rules-revision1", 
"https://www.espn.com/f1/story/_/id/31270374/aston-martin-wants-aero-rules-revision2", 
"https://www.espn.com/f1/story/_/id/31270374/aston-martin-wants-aero-rules-revision3", 
"http://www.espn.com/espn/wire?section=rpm&id=312676841", "http://www.espn.com/espn/wire?section=rpm&id=312676842", 
"http://www.espn.com/espn/wire?section=rpm&id=312676843", "https://www.espn.com/f1/story/_/id/31263611/hamilton-says-f1-rivalry-vettel-remains-favourite1", 
"https://www.espn.com/f1/story/_/id/31263611/hamilton-says-f1-rivalry-vettel-remains-favourite2", 
"https://www.espn.com/f1/story/_/id/31263611/hamilton-says-f1-rivalry-vettel-remains-favourite3", 
"http://www.espn.com/espn/wire?section=rpm&id=312634771", "http://www.espn.com/espn/wire?section=rpm&id=312634772", 
"http://www.espn.com/espn/wire?section=rpm&id=312634773", "https://www.espn.com/f1/story/_/id/31263349/mercedes-gone-hunted-hunters1", 
"https://www.espn.com/f1/story/_/id/31263349/mercedes-gone-hunted-hunters2", 
"https://www.espn.com/f1/story/_/id/31263349/mercedes-gone-hunted-hunters3", 
"https://www.espn.com/f1/story/_/id/31263278/verstappen-calls-clarity-messy-track-limits-rules1", 
"https://www.espn.com/f1/story/_/id/31263278/verstappen-calls-clarity-messy-track-limits-rules2", 
"https://www.latimes.com/sports/story/2021-04-18/max-verstappen-lewis-hamilton-emilia-romagna-grand-prix1", 
"https://edition.cnn.com/2021/04/18/motorsport/max-verstappen-lewis-hamilton-imola-gp-spt-intl/index.html1", 
"https://edition.cnn.com/2021/04/18/motorsport/max-verstappen-lewis-hamilton-imola-gp-spt-intl/index.html2", 
"https://edition.cnn.com/2021/04/18/motorsport/max-verstappen-lewis-hamilton-imola-gp-spt-intl/index.html4", 
"https://edition.cnn.com/2021/04/18/motorsport/max-verstappen-lewis-hamilton-imola-gp-spt-intl/index.html6", 
"https://edition.cnn.com/2021/04/18/motorsport/max-verstappen-lewis-hamilton-imola-gp-spt-intl/index.html8"
), class = "data.frame")

图像位于不同的位置。你可以试试下面的代码-

library(rvest)

lapply(URLs, function(x) {
  x %>% 
    read_html() %>% 
    html_nodes("picture source") %>%
    html_attr("data-srcset") %>% 
    strsplit(',') %>%
    .[[1]] %>%
    na.omit %>%
    trimws %>%
    .[1] -> img
  if(!is.na(img))  download.file(img, paste0('photo', Sys.time(), '.jpeg'))
})