R：URL 上的 readLines 导致缺失行

Question

当我 readLines() 在 URL, I get missing lines or values. This might be due to spacing that the computer can't read 上时。

当您使用上面的 URL 时，CTR + F 会找到 38 个匹配“TV-”的文本实例。另一方面，当我运行 readLines() 和 grep("TV-", HTML) 我只找到 12.

那么，我怎样才能避免编码/间距错误，以便我可以获得 HTML 的完整行？

Answer 1

您可以使用 rvest 来抓取数据。例如，要获取所有标题，您可以执行以下操作：

library(rvest)

url <- 'https://www.imdb.com/search/title/?locations=Vancouver,%20British%20Columbia,%20Canada&start=1.json'
url %>%
  read_html() %>%
  html_nodes('div.lister-item-content h3 a') %>%
  html_text() ->  all_titles

all_titles

# [1] "The Haunting of Bly Manor"               "The Haunting of Hill House"             
# [3] "Supernatural"                            "Helstrom"                               
# [5] "The 100"                                 "Lucifer"                                
# [7] "Criminal Minds"                          "Fear the Walking Dead"                  
# [9] "A Babysitter's Guide to Monster Hunting" "The Stand"   
#...                 
#...

R：URL 上的 readLines 导致缺失行

R: readLines on a URL leads to missing lines

r

html-parsing

web