使用 R 抓取链接列表

Scraping list of links using R

我想使用 R 抓取并提取所有依赖链接的列表。 例如,考虑: List of Cuisines on wikipedia 这里的菜系分为地区、种族等,它们本身就是链接,并进一步细分为更多的链接和层次结构。我想在 R 中提取整个层次结构。 使用一般的 RegEx 定义链接会 return 网页中的所有链接,但我希望有一个 table 列出所有依赖项,例如:

  1. 菜系列表:
    • 亚洲美食列表
    • 欧洲美食列表
    • 中欧美食列表
    • 奥地利美食
    • 保加利亚美食
    • 捷克美食
    • 德国菜..等等。
    • 海洋美食列表 ...

我知道如何使用 R 从一个网页上抓取数据。我对它还很陌生,想知道如何提取链接之间的依赖关系。

例如,您可以执行以下操作:如果这就是您要查找的内容

require(rvest)
require(magrittr)
session <- html_session("https://en.wikipedia.org/wiki/List_of_cuisines")
session %>% html_nodes("ul:nth-child(13) a") %>% html_text()
 [1] "Ainu"               "Akan"               "Arab"               "Assyrian"           "Balochi"           
 [6] "Berber"             "Buddhist"           "Bulgarian"          "Cajun"              "Chinese Islamic"   
[11] "Circassian"         "Crimean Tatar"      "Inuit"              "Italian American"   "Jewish"            
[16] "Sephardic"          "Mizrahi"            "Bukharan"           "Syrian Jewish"      "Kurdish"           
[21] "Malayali Food"      "Louisiana Creole"   "Maharashtrian"      "Mordovian"          "Native American"   
[26] "Parsi"              "Pashtun"            "Pennsylvania Dutch" "Peranakan"          "Persian cuisine"   
[31] "Punjabi"            "Rajasthani"         "Romani"             "Sami"               "Sindhi"            
[36] "Tatar"              "Yamal"              "Zanzibari"          "South Indian"    

如果您想更深入地挖掘并抓取所有链接,您可以按以下方式继续:

cousin_links <- session %>% html_nodes("ul:nth-child(13) a") %>% html_attr("href")
articles <- lapply(cousin_links, jump_to, x = session)
explainaition <- lapply(articles, function(a){
  a %>% html %>% html_node("p") %>% html_text
})

这会为您提供第一个维基百科解释的列表(内容框上方的那个

> head(explainaition)
[[1]]
[1] "Ainu cuisine is the cuisine of the ethnic Ainu in Japan. The cuisine differs markedly from that of the majority Yamato people of Japan. Raw meat like sashimi, for example, is not served in Ainu cuisine, which instead uses methods such as boiling, roasting and curing to prepare meat. The island of Hokkaidō in northern Japan is where most Ainu live today; however, they once inhabited most of the Kuril islands, the southern half of Sakhalin island, and parts of northern Honshū Island."

[[2]]
[1] "Akan cuisine, the cuisine of the Akan people, includes meat and fish (seafood) grilled over hot coals, wide and varied range of soups, stews, several kinds of starch foods, groundnut, palm, patties (or empanadas), ground corn (maize), sadza, ugali."

[[3]]
[1] "Arab cuisine (Arabic: مطبخ عربي‎) is defined as the various regional cuisines spanning the Arab world, from Mesopotamia to North-Africa. Arab cuisine often incorporates the Levantine and Egyptian culinary traditions."

[[4]]
[1] "The cuisine of the indigenous Assyrian people from northern Iraq, north eastern Syria, north western Iran and south eastern Turkey is similar to other Middle Eastern cuisines. It is rich in grains, meat, tomato, and potato. Rice is usually served with every meal accompanied by a stew which is typically poured over the rice. Tea is typically consumed at all times of the day with or without meals, alone or as a social drink. Cheese, crackers, biscuits, baklawa, or other snacks are often served alongside the tea as appetizers. Dietary restrictions may apply during Lent in which certain types of foods may not be consumed; often meaning animal-derived. Alcohol is rather popular specifically in the form of Arak and Wheat Beer. Unlike in Jewish cuisine and Islamic cuisines in the region, pork is allowed, but it is not widely consumed because of restrictions upon availability imposed by the Muslim majority."

[[5]]
[1] "Balochi cuisine refers to the food and cuisine of the Baloch people from the Balochistan region, comprising the Pakistani Balochistan province as well as Sistan and Baluchestan in Iran and Balochistan, Afghanistan. Baloch food has a regional variance in contrast to many other cuisines of Pakistan[1][2][3][4] and Iran."

[[6]]
[1] "The Amazigh (Berber) cuisine is considered as a traditional cuisine which evolved little in the course of time."