使用 R 从网页中提取元描述
Extract meta description from web pages using R
您好,我正在尝试检索这些网页元描述
来自页面来源“
Data<-data.frame(Pages=c(
"http://boingboing.net/2016/06/16/spam-king-sanford-wallace.html",
"http://boingboing.net/2016/06/16/omg-the-japanese-trump-commer.html",
"http://boingboing.net/2016/06/16/omar-mateen-posted-to-facebook.html"))
期望的输出
Data$Meta_Description<-data.frame(Extracted=c(
"Sanford Wallace gets 2.5 years in prison for 27 million Facebook",
"OMG, this Japanese Trump Commercial is everything",
"Omar Mateen posted to Facebook during Orlando mass shooting"))
我试图用 httr 完成这个任务,但我无法以所需的输出格式获取它,也无法从使用 GET 命令检索到的内容中提取内容
library (httr)
resp<-GET ("http://boingboing.net/2016/06/16/spam-king-sanford-wallace.html")
str(resp)
List of 10
$ url : chr "http://boingboing.net/2016/06/16/spam-king-sanford-wallace.html"
$ status_code: int 200
$ headers :List of 22
..$ server : chr "Apache/2.2"
我需要从源代码中提取的字段在这个字符串之后
<meta itemprop="description" content="
像这样
<meta itemprop="description" content="'Spam King'
Sanford Wallace gets 2.5 years in prison for 27 million Facebook scam messages"
你真的只需要rvest
。因为它们都是 <h1>
标题,您可以遍历 URL 列表,挑选出标题:
library(rvest)
sapply(Data$Pages,
function(url){
url %>%
as.character() %>% # in case strings are stored as factors
read_html() %>%
html_nodes('h1') %>%
html_text()
})
# [1] "'Spam King' Sanford Wallace gets 2.5 years in prison for 27 million Facebook scam messages"
# [2] "OMG, this Japanese Trump Commercial is everything"
# [3] "Omar Mateen posted to Facebook during Orlando mass shooting"
或者如果你真的想要抓取 <meta>
标签,你可以用同样的方法来做,尽管选择器更麻烦:
sapply(Data$Pages, function(url){
url %>%
as.character() %>%
read_html() %>%
html_nodes(xpath = '//meta[@itemprop="description"]') %>%
html_attr('content')
})
无论哪种方式,您都会得到相同的结果。
您好,我正在尝试检索这些网页元描述
来自页面来源“
Data<-data.frame(Pages=c(
"http://boingboing.net/2016/06/16/spam-king-sanford-wallace.html",
"http://boingboing.net/2016/06/16/omg-the-japanese-trump-commer.html",
"http://boingboing.net/2016/06/16/omar-mateen-posted-to-facebook.html"))
期望的输出
Data$Meta_Description<-data.frame(Extracted=c(
"Sanford Wallace gets 2.5 years in prison for 27 million Facebook",
"OMG, this Japanese Trump Commercial is everything",
"Omar Mateen posted to Facebook during Orlando mass shooting"))
我试图用 httr 完成这个任务,但我无法以所需的输出格式获取它,也无法从使用 GET 命令检索到的内容中提取内容
library (httr)
resp<-GET ("http://boingboing.net/2016/06/16/spam-king-sanford-wallace.html")
str(resp)
List of 10
$ url : chr "http://boingboing.net/2016/06/16/spam-king-sanford-wallace.html"
$ status_code: int 200
$ headers :List of 22
..$ server : chr "Apache/2.2"
我需要从源代码中提取的字段在这个字符串之后
<meta itemprop="description" content="
像这样
<meta itemprop="description" content="'Spam King'
Sanford Wallace gets 2.5 years in prison for 27 million Facebook scam messages"
你真的只需要rvest
。因为它们都是 <h1>
标题,您可以遍历 URL 列表,挑选出标题:
library(rvest)
sapply(Data$Pages,
function(url){
url %>%
as.character() %>% # in case strings are stored as factors
read_html() %>%
html_nodes('h1') %>%
html_text()
})
# [1] "'Spam King' Sanford Wallace gets 2.5 years in prison for 27 million Facebook scam messages"
# [2] "OMG, this Japanese Trump Commercial is everything"
# [3] "Omar Mateen posted to Facebook during Orlando mass shooting"
或者如果你真的想要抓取 <meta>
标签,你可以用同样的方法来做,尽管选择器更麻烦:
sapply(Data$Pages, function(url){
url %>%
as.character() %>%
read_html() %>%
html_nodes(xpath = '//meta[@itemprop="description"]') %>%
html_attr('content')
})
无论哪种方式,您都会得到相同的结果。