使用 rvest 抓取 <li> 项
Scraping <li> item using rvest
我想刮 https://www.deutsche-biographie.de/ 。具体来说,我有兴趣抓取有关每个人的以下信息
- 姓名
- 出生年份
- 逝世年份
- 职业
- 出生地(源代码中'geburt')和坐标
- 死亡地点(源代码'tod')和坐标
- activity的地点(源代码中'wirk')和坐标
用下面的代码,我抓取了姓名、出生年份、死亡年份和职业。
library(rvest)
library(dplyr)
page = read_html(x = "https://www.deutsche-biographie.de/search?_csrf=45b6ee54-385e-4777-90bf-9067923e6a00&name=meier")
name = page %>% html_nodes(".media-heading a") %>% html_text()
information = page %>% html_nodes("#secondColumn p") %>% html_text()
result = data.frame(name, information, stringsAsFactors = FALSE)
#manipulate data in columns
result$yearofbirth = sub("(^[^-]+)-.*", "\1", result$information) #extract characters before dash
result$yearofdeath = sub(',.*$','', result$information)
result$yearofdeath = sub('.*-','', result$yearofdeath) #extract characters after dash
result$profession = sub("^.*?,", "", result$information) #extract characters after comma
result$profession = trimws(result$profession, whitespace = "[ \t\r\n]") #trim leading and trailing white space
result$information = NULL
但是,我正在努力从
<li class="media treffer-liste-elem" id="treffer-sfz55763" data-orte="Rendsburg@54.3012661,9.6596678@geburt;Rendsburg@54.3012661,9.6596678@wirk;Kiel@54.3216753,10.1371858@wirk;Magdeburg@52.1315889,11.6399609@wirk;Rostock@54.14736345,12.109015599915@wirk;Frankfurt/Oder@52.3438922,14.5544166@wirk;Gottorf@54.5117924,9.54054973309832@wirk;Padua@45.407059,11.8767269@wirk;Bologna@44.4936714,11.3430347@wirk;Basel@47.5429886,7.5969912@wirk;Königsberg@54.7066424,20.5105165@wirk;Danzig@54.3482114,18.6542829@wirk;Prag@50.087656,14.4212126@wirk;Amsterdam@52.3710089,4.9001115@wirk;Frankfurt@50.1432793,8.6805975@wirk;Rostock@54.14736345,12.109015599915@wirk;Magdeburg@52.1315889,11.6399609@tod" data-name="Maier, Michael">
如果有任何关于如何抓取这些地方的提示,我将不胜感激!
最好的,娜塔莉
希望此解决方案对您有所帮助:
page %>%
html_elements("#secondColumn > ul") %>%
html_children() %>% html_attr("data-orte") %>%
str_split(";")
实现您想要的结果的另一种选择可能如下所示:
第一步与@Kafe提出的解决方案类似:从data-orte
属性中获取地点信息并按;
拆分得到地点列表
作为第二步,我利用 lapply
将出生地、死亡地和 activity 放在 result
数据框的不同列中
在第三步中我大量使用了tidyr::extract
,这使得从一个字符串中提取多条信息并一步将它们放入单独的列中变得容易。
注意:我也用了不同的方法来提取出生和死亡的年份。
library(rvest)
library(dplyr)
page = read_html(x = "https://www.deutsche-biographie.de/search?_csrf=45b6ee54-385e-4777-90bf-9067923e6a00&name=meier")
name = page %>% html_nodes(".media-heading a") %>% html_text()
information = page %>% html_nodes("div.media-body p") %>% html_text()
result = data.frame(name, information)
result$information <- result$information %>% trimws() %>% strsplit(split = ", \n") %>% lapply(trimws)
result <- tidyr::unnest_wider(result, information) %>%
rename(years = 2, profession = 3) %>%
tidyr::extract(years, into = c("year_of_birth", "year_of_death"), regex = "^.*?(\d{4}).*?\-\s(\d{4})")
places <- page %>% html_nodes("li.treffer-liste-elem") %>% html_attr("data-orte") %>% strsplit(";")
result$place_of_birth <- lapply(places, function(x) x[grepl("@geburt$", x)]) %>% unlist()
result$place_of_death <- lapply(places, function(x) x[grepl("@tod$", x)]) %>% unlist()
result$place_of_activity <- lapply(places, function(x) x[grepl("@wirk$", x)])
result <- result %>%
tidyr::extract(place_of_birth, into = c("place_of_birth", "place_of_birth_coord"), regex = "^(.*?)@(.*?)@.*$") %>%
tidyr::extract(place_of_death, into = c("place_of_death", "place_of_death_coord"), regex = "^(.*?)@(.*?)@.*$")
result
#> # A tibble: 10 × 9
#> name year_of_birth year_of_death profession place_of_birth place_of_birth_…
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 Meier… 1718 1777 Philosoph Ammendorf bei… 51.4265204,11.9…
#> 2 Meyer… 1772 1849 Jurist; B… Frankfurt/Main 50.1432793,8.68…
#> 3 Meier… 1809 1898 Bremer Ka… Bremen 53.0758099,8.80…
#> 4 Major… 1502 1574 lutherisc… Nürnberg 49.4538501,11.0…
#> 5 Meyer… 1810 1874 schweizer… Sursee Kanton… 47.1774826,8.10…
#> 6 Maier… 1568 1622 Alchemist… Rendsburg 54.3012661,9.65…
#> 7 Meier… 1692 1745 Jurist; A… Bayreuth 49.9427202,11.5…
#> 8 Mejer… 1818 1893 Jurist; P… Zellerfeld (H… 51.804126,10.33…
#> 9 Meyer… 1474 1548 Bürgermei… Basel 47.5429886,7.59…
#> 10 Hirsc… 1770 1851 Mathemati… Friesack (Mit… 52.7395263,12.5…
#> # … with 3 more variables: place_of_death <chr>, place_of_death_coord <chr>,
#> # place_of_activity <list>
由 reprex package (v2.0.1)
于 2021-11-21 创建
我想刮 https://www.deutsche-biographie.de/ 。具体来说,我有兴趣抓取有关每个人的以下信息
- 姓名
- 出生年份
- 逝世年份
- 职业
- 出生地(源代码中'geburt')和坐标
- 死亡地点(源代码'tod')和坐标
- activity的地点(源代码中'wirk')和坐标
用下面的代码,我抓取了姓名、出生年份、死亡年份和职业。
library(rvest)
library(dplyr)
page = read_html(x = "https://www.deutsche-biographie.de/search?_csrf=45b6ee54-385e-4777-90bf-9067923e6a00&name=meier")
name = page %>% html_nodes(".media-heading a") %>% html_text()
information = page %>% html_nodes("#secondColumn p") %>% html_text()
result = data.frame(name, information, stringsAsFactors = FALSE)
#manipulate data in columns
result$yearofbirth = sub("(^[^-]+)-.*", "\1", result$information) #extract characters before dash
result$yearofdeath = sub(',.*$','', result$information)
result$yearofdeath = sub('.*-','', result$yearofdeath) #extract characters after dash
result$profession = sub("^.*?,", "", result$information) #extract characters after comma
result$profession = trimws(result$profession, whitespace = "[ \t\r\n]") #trim leading and trailing white space
result$information = NULL
但是,我正在努力从
<li class="media treffer-liste-elem" id="treffer-sfz55763" data-orte="Rendsburg@54.3012661,9.6596678@geburt;Rendsburg@54.3012661,9.6596678@wirk;Kiel@54.3216753,10.1371858@wirk;Magdeburg@52.1315889,11.6399609@wirk;Rostock@54.14736345,12.109015599915@wirk;Frankfurt/Oder@52.3438922,14.5544166@wirk;Gottorf@54.5117924,9.54054973309832@wirk;Padua@45.407059,11.8767269@wirk;Bologna@44.4936714,11.3430347@wirk;Basel@47.5429886,7.5969912@wirk;Königsberg@54.7066424,20.5105165@wirk;Danzig@54.3482114,18.6542829@wirk;Prag@50.087656,14.4212126@wirk;Amsterdam@52.3710089,4.9001115@wirk;Frankfurt@50.1432793,8.6805975@wirk;Rostock@54.14736345,12.109015599915@wirk;Magdeburg@52.1315889,11.6399609@tod" data-name="Maier, Michael">
如果有任何关于如何抓取这些地方的提示,我将不胜感激! 最好的,娜塔莉
希望此解决方案对您有所帮助:
page %>%
html_elements("#secondColumn > ul") %>%
html_children() %>% html_attr("data-orte") %>%
str_split(";")
实现您想要的结果的另一种选择可能如下所示:
第一步与@Kafe提出的解决方案类似:从
data-orte
属性中获取地点信息并按;
拆分得到地点列表作为第二步,我利用
lapply
将出生地、死亡地和 activity 放在result
数据框的不同列中在第三步中我大量使用了
tidyr::extract
,这使得从一个字符串中提取多条信息并一步将它们放入单独的列中变得容易。
注意:我也用了不同的方法来提取出生和死亡的年份。
library(rvest)
library(dplyr)
page = read_html(x = "https://www.deutsche-biographie.de/search?_csrf=45b6ee54-385e-4777-90bf-9067923e6a00&name=meier")
name = page %>% html_nodes(".media-heading a") %>% html_text()
information = page %>% html_nodes("div.media-body p") %>% html_text()
result = data.frame(name, information)
result$information <- result$information %>% trimws() %>% strsplit(split = ", \n") %>% lapply(trimws)
result <- tidyr::unnest_wider(result, information) %>%
rename(years = 2, profession = 3) %>%
tidyr::extract(years, into = c("year_of_birth", "year_of_death"), regex = "^.*?(\d{4}).*?\-\s(\d{4})")
places <- page %>% html_nodes("li.treffer-liste-elem") %>% html_attr("data-orte") %>% strsplit(";")
result$place_of_birth <- lapply(places, function(x) x[grepl("@geburt$", x)]) %>% unlist()
result$place_of_death <- lapply(places, function(x) x[grepl("@tod$", x)]) %>% unlist()
result$place_of_activity <- lapply(places, function(x) x[grepl("@wirk$", x)])
result <- result %>%
tidyr::extract(place_of_birth, into = c("place_of_birth", "place_of_birth_coord"), regex = "^(.*?)@(.*?)@.*$") %>%
tidyr::extract(place_of_death, into = c("place_of_death", "place_of_death_coord"), regex = "^(.*?)@(.*?)@.*$")
result
#> # A tibble: 10 × 9
#> name year_of_birth year_of_death profession place_of_birth place_of_birth_…
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 Meier… 1718 1777 Philosoph Ammendorf bei… 51.4265204,11.9…
#> 2 Meyer… 1772 1849 Jurist; B… Frankfurt/Main 50.1432793,8.68…
#> 3 Meier… 1809 1898 Bremer Ka… Bremen 53.0758099,8.80…
#> 4 Major… 1502 1574 lutherisc… Nürnberg 49.4538501,11.0…
#> 5 Meyer… 1810 1874 schweizer… Sursee Kanton… 47.1774826,8.10…
#> 6 Maier… 1568 1622 Alchemist… Rendsburg 54.3012661,9.65…
#> 7 Meier… 1692 1745 Jurist; A… Bayreuth 49.9427202,11.5…
#> 8 Mejer… 1818 1893 Jurist; P… Zellerfeld (H… 51.804126,10.33…
#> 9 Meyer… 1474 1548 Bürgermei… Basel 47.5429886,7.59…
#> 10 Hirsc… 1770 1851 Mathemati… Friesack (Mit… 52.7395263,12.5…
#> # … with 3 more variables: place_of_death <chr>, place_of_death_coord <chr>,
#> # place_of_activity <list>
由 reprex package (v2.0.1)
于 2021-11-21 创建