URL 不变时，如何使用过滤器从网站抓取数据？

Question

我从 R 中的 this list 抓取了数据，但它不包括网站过滤器（List = Oxford 3000 和 CEFR level = A1）我已经申请了，但据我所知，没有变量可以用来过滤 R 中的数据。

有没有其他方法可以获得我想要的数据？ URL 似乎没有随过滤而改变。

这是我的代码：

url <- "https://www.oxfordlearnersdictionaries.com/wordlists/oxford3000-5000" 

url %>%
  map(. %>%
    read_html() %>%
      html_nodes(".belong-to , .pos , a") %>%
      html_text()
  ) %>%
  unlist() -> ox3ka1

Answer 1

要select只有带过滤条件的词a1我们可以做以下操作，

df = 'https://www.oxfordlearnersdictionaries.com/wordlists/oxford3000-5000' %>% read_html() %>% html_nodes('.top-g') %>% html_nodes( "li[data-ox5000 = 'a1']") %>% html_text()

head(df)
[1] "   a   indefinite articlea1      " "   about   adverba1      "         "   about   prepositiona1      "    "   above   adverba1      "        
[5] "   above   prepositiona1      "    "   across   adverba1      "

进一步参考，

URL 不变时，如何使用过滤器从网站抓取数据？

How to scrape data with filters from the website when the URL doesn't change?

r

web-scraping

rvest