从 R 中的 link 中提取标题

Question

我正在使用 R 中的 rvest 包练习网络抓取。到目前为止，该页面是一个很好的指南。 (http://zevross.com/blog/2015/05/19/scrape-website-data-with-the-new-r-package-rvest/)。使用工具 Selector Gadget，我可以识别 class 或 div 元素对我想要的项目的引用（据我所知）。

所以我刚刚访问了维基百科并试图提取 U.S 的列表。总统。该页面的 link 是 https://en.wikipedia.org/wiki/List_of_Presidents_of_the_United_States。 Selector Gadget 告诉我元素 class/div/????? （不知道怎么称呼它）是 "big a".

到目前为止，这是我的代码：

site = read_html("https://en.wikipedia.org/wiki/List_of_Presidents_of_the_United_States")
fnames = html_nodes(site,"big a")

部分输出为：

{xml_nodeset (44)}
 [1] <a href="/wiki/George_Washington" title="George Washington">George Washington</a>
 [2] <a href="/wiki/John_Adams" title="John Adams">John Adams</a>
 [3] <a href="/wiki/Thomas_Jefferson" title="Thomas Jefferson">Thomas Jefferson</a>
 [4] <a href="/wiki/James_Madison" title="James Madison">James Madison</a>
 [5] <a href="/wiki/James_Monroe" title="James Monroe">James Monroe</a>
 [6] <a href="/wiki/John_Quincy_Adams" title="John Quincy Adams">John Quincy Adams</a>
 [7] <a href="/wiki/Andrew_Jackson" title="Andrew Jackson">Andrew Jackson</a>
 [8] <a href="/wiki/Martin_Van_Buren" title="Martin Van Buren">Martin Van Buren</a>

太棒了！所以我用 links 提取了名字！我只是想要名字，所以我不确定如何在这里进行。有没有办法轻松抓取link html代码之间的名字？或者我应该使用 html_nodes 函数来获取另一个元素吗？我觉得我很接近！

感谢您的帮助。

Answer 1

名字有两个来源。标题属性和文本。它们的格式可能略有不同，或者其中一个可能包含中间名缩写或其他任何内容。使用你最喜欢的那个。

html_attr(fnames, "title")

或

html_text(fnames)

从 R 中的 link 中提取标题

Extracting title from link in R

css

substring

r

rvest