如何将维基百科列表中的数据抓取到 pandas 数据框中

How to scrape data from wikipedia list into pandas dataframe

我正在尝试从维基百科页面中抓取列表,而不是 table。它说“列表索引超出范围”:我该如何解决这个问题?

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = 'https://it.m.wikipedia.org/wiki/Premio_Bagutta'
data = requests.get(url)
soup= BeautifulSoup(data.content, "html.parser")
raw = soup.find_all("div", {"class": "div-col"})[0].find_all("li")

df = pd.DataFrame([[item.get_text().split(" ")[0],
                    item.find_next("a").get("title"),
                    item.find_next("i").get_text()[1:-1]]
                   for item in raw if item.find_next("i")],
                  columns=("Year"))
print(df.head())

你可以试试这个:

import pandas as pd
import requests
from bs4 import BeautifulSoup

data = requests.get("https://it.m.wikipedia.org/wiki/Premio_Bagutta")
raw = BeautifulSoup(data.content, "html.parser").find_all(
    "section", class_="mf-section-2 collapsible-block"
)[0]

raw_years = [item.text.replace("\n", "") for item in raw.find_all("p")]
raw_authors = [item for item in raw.find_all("ul")]

# For some years, there are several authors, so you have to iterate in sync
years = []
authors = []
for (year, author) in zip(raw_years, raw_authors):
    years.append(year)
    authors.append(author.text.split("\n"))

df = pd.DataFrame({"year": years, "author": authors}).explode("author")

print(df)
# Output
    year                                                                author
0   1927  Giovanni Battista Angioletti, Il giorno del giudizio[11][12] (Ribet)
1   1928                      Giovanni Comisso, Gente di mare[13][14] (Treves)
2   1929              Vincenzo Cardarelli, Il sole a picco[15][16] (Mondadori)
3   1930                Gino Rocca, Gli ultimi furono i primi[17][18] (Treves)
4   1931              Giovanni Titta Rosa, Il varco nel muro[19][20] (Carabba)
..   ...                                                                   ...
82  2018                Helena Janeczek, La ragazza con la Leica[154] (Guanda)
83  2019                            Marco Balzano, Resto qui[9][155] (Einaudi)
84  2020                        Enrico Deaglio, La bomba[8][156] (Feltrinelli)
85  2021                         Giorgio Fontana, Prima di noi[157] (Sellerio)
86  2022                    Benedetta Craveri, La contessa[158][159] (Adelphi)