R 软件 - rvest 包,"download number" 中的错误
R software - rvest package, error in "download number"
我想下载亚马逊图书评论数,但我有一个问题
我尝试了以下方法:
library(rvest)
url<-paste0("http://www.amazon.com/s/ref=lp_4_nr_p_72_3?",
"fst=as%3Aoff&rh=n%3A283155%2Cn%3A%211000%2C",
"n%3A4%2Cp_72%3A1250224011&bbn=4&ie=UTF8&qid",
"=1440446201&rnid=1250219011")
html<-html(url)
Reviews <- try({html_nodes(html, "#s-results-list-atf .a-text-normal:nth-child(2)") %>%
html_text()}, silent = TRUE)
但我的 R 控制台中只有 4 个评论计数,而不是 12 个(使用选择器小工具)。我做错了什么?
当我尝试下载书名时,我没有遇到同样的问题...只是在评论数方面。
Book <- try({ html_nodes(html, ".s-access-title") %>%
html_text()}, silent = TRUE)
第link页Amazon Page
这可能不是规范的方法,但这是我所做的有效方法:
#via Inspect element in Chrome, the relevant info is
# in an <a> tag with class 'a-size-small a-link-normal a-text-normal'
# but this does not uniquely identify the review counts
# (e.g., the .00 Buy used & new... bit is also there)
# so we take a step up and find that both the rating
# and the review count are stored in a <div> tag
# with class 'a-row a-spacing-mini'
x<-html(url) %>% html_nodes("div.a-row.a-spacing-mini") %>%
html_nodes("a.a-size-small.a-link-normal.a-text-normal") %>%
html_text
#upon inspection of x, we can see that the relevant numbers
# always appear by themselves, thus:
> x[!is.na(as.integer(gsub(",","",x)))]
[1] "168" "232" "1,607" "2,226" "1,060" "25" "731" "2,374" "345" "7,205"
[11] "1,134" "1,137"
我想下载亚马逊图书评论数,但我有一个问题
我尝试了以下方法:
library(rvest)
url<-paste0("http://www.amazon.com/s/ref=lp_4_nr_p_72_3?",
"fst=as%3Aoff&rh=n%3A283155%2Cn%3A%211000%2C",
"n%3A4%2Cp_72%3A1250224011&bbn=4&ie=UTF8&qid",
"=1440446201&rnid=1250219011")
html<-html(url)
Reviews <- try({html_nodes(html, "#s-results-list-atf .a-text-normal:nth-child(2)") %>%
html_text()}, silent = TRUE)
但我的 R 控制台中只有 4 个评论计数,而不是 12 个(使用选择器小工具)。我做错了什么?
当我尝试下载书名时,我没有遇到同样的问题...只是在评论数方面。
Book <- try({ html_nodes(html, ".s-access-title") %>%
html_text()}, silent = TRUE)
第link页Amazon Page
这可能不是规范的方法,但这是我所做的有效方法:
#via Inspect element in Chrome, the relevant info is
# in an <a> tag with class 'a-size-small a-link-normal a-text-normal'
# but this does not uniquely identify the review counts
# (e.g., the .00 Buy used & new... bit is also there)
# so we take a step up and find that both the rating
# and the review count are stored in a <div> tag
# with class 'a-row a-spacing-mini'
x<-html(url) %>% html_nodes("div.a-row.a-spacing-mini") %>%
html_nodes("a.a-size-small.a-link-normal.a-text-normal") %>%
html_text
#upon inspection of x, we can see that the relevant numbers
# always appear by themselves, thus:
> x[!is.na(as.integer(gsub(",","",x)))]
[1] "168" "232" "1,607" "2,226" "1,060" "25" "731" "2,374" "345" "7,205"
[11] "1,134" "1,137"