使用 rvest 从 R 中的 crickbuzz 中抓取匹配分数
using rvest to scrape match scores from crickbuzz in R
我正在抓取页面 Crickbuzz scores 以获取比赛详情。我正在使用选择器小工具获取 css 标签。到目前为止我所做的事情是:
crickbuzz <- read_html(httr::GET("http://www.cricbuzz.com/cricket-match/live-scores"))
matches_dates <- crickbuzz %>%
html_nodes(".schedule-date:nth-child(1)") %>%
html_text()
我已获取比赛、比分和场地,但难以获取日期。
我得到的结果低于上面的代码
> matches_dates
" - " " - " " " " " " " " " " "
" " " " " " " - " " - " " - "
表示获取21个元素,也就是当前有21个匹配项,但没有获取文本。
然后我看到了 html_nodes() 中的内容
它给出了 :
{xml_nodeset (21)}
1 <span class="schedule-date" timestamp="1452132000000" format="MMM dd'">
</span>
2 <span class="schedule-date" timestamp="1452132000000" format="MMM dd'">
</span>
3 <span class="schedule-date" timestamp="1452132000000" format="MMM dd'">
</span> and so on....
这意味着我没有从标签中获取文本。
怎么做?
您需要使用时间戳属性提取它:
library(rvest)
crickbuzz <- read_html(httr::GET("http://www.cricbuzz.com/cricket-match/live-scores"))
matches_dates <- crickbuzz %>%
html_nodes(".schedule-date:nth-child(1)")%>%
html_attr("timestamp")
matches_dates
[1] "1452268800000" "1452132000000" "1452247200000" "1452242400000" "1452327000000" "1452290400000" "1452310200000" "1452310200000" "1452310200000"
[10] "1452310200000" "1452324600000" "1452324600000" "1452324600000" "1452324600000" "1452324600000" "1452150000000" "1452153600000" "1452153600000"
# this is the unix time and so if you need to convert to date-time format, follow the answer
to this question:
我正在抓取页面 Crickbuzz scores 以获取比赛详情。我正在使用选择器小工具获取 css 标签。到目前为止我所做的事情是:
crickbuzz <- read_html(httr::GET("http://www.cricbuzz.com/cricket-match/live-scores"))
matches_dates <- crickbuzz %>%
html_nodes(".schedule-date:nth-child(1)") %>%
html_text()
我已获取比赛、比分和场地,但难以获取日期。 我得到的结果低于上面的代码
> matches_dates
" - " " - " " " " " " " " " " "
" " " " " " " - " " - " " - "
表示获取21个元素,也就是当前有21个匹配项,但没有获取文本。
然后我看到了 html_nodes() 中的内容 它给出了 :
{xml_nodeset (21)}
1 <span class="schedule-date" timestamp="1452132000000" format="MMM dd'">
</span>
2 <span class="schedule-date" timestamp="1452132000000" format="MMM dd'">
</span>
3 <span class="schedule-date" timestamp="1452132000000" format="MMM dd'">
</span> and so on....
这意味着我没有从标签中获取文本。 怎么做?
您需要使用时间戳属性提取它:
library(rvest)
crickbuzz <- read_html(httr::GET("http://www.cricbuzz.com/cricket-match/live-scores"))
matches_dates <- crickbuzz %>%
html_nodes(".schedule-date:nth-child(1)")%>%
html_attr("timestamp")
matches_dates
[1] "1452268800000" "1452132000000" "1452247200000" "1452242400000" "1452327000000" "1452290400000" "1452310200000" "1452310200000" "1452310200000"
[10] "1452310200000" "1452324600000" "1452324600000" "1452324600000" "1452324600000" "1452324600000" "1452150000000" "1452153600000" "1452153600000"
# this is the unix time and so if you need to convert to date-time format, follow the answer
to this question: