R:URL 上的 readLines 导致缺失行

R: readLines on a URL leads to missing lines

当我 readLines()URL, I get missing lines or values. This might be due to spacing that the computer can't read 上时。

当您使用上面的 URL 时,CTR + F 会找到 38 个匹配“TV-”的文本实例。另一方面,当我 运行 readLines()grep("TV-", HTML) 我只找到 12.

那么,我怎样才能避免编码/间距错误,以便我可以获得 HTML 的完整行?

您可以使用 rvest 来抓取数据。例如,要获取所有标题,您可以执行以下操作:

library(rvest)

url <- 'https://www.imdb.com/search/title/?locations=Vancouver,%20British%20Columbia,%20Canada&start=1.json'
url %>%
  read_html() %>%
  html_nodes('div.lister-item-content h3 a') %>%
  html_text() ->  all_titles

all_titles

# [1] "The Haunting of Bly Manor"               "The Haunting of Hill House"             
# [3] "Supernatural"                            "Helstrom"                               
# [5] "The 100"                                 "Lucifer"                                
# [7] "Criminal Minds"                          "Fear the Walking Dead"                  
# [9] "A Babysitter's Guide to Monster Hunting" "The Stand"   
#...                 
#...