如何将维基百科列表中的数据抓取到 pandas 数据框中
How to scrape data from wikipedia list into pandas dataframe
我正在尝试从维基百科页面中抓取列表,而不是 table。它说“列表索引超出范围”:我该如何解决这个问题?
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://it.m.wikipedia.org/wiki/Premio_Bagutta'
data = requests.get(url)
soup= BeautifulSoup(data.content, "html.parser")
raw = soup.find_all("div", {"class": "div-col"})[0].find_all("li")
df = pd.DataFrame([[item.get_text().split(" ")[0],
item.find_next("a").get("title"),
item.find_next("i").get_text()[1:-1]]
for item in raw if item.find_next("i")],
columns=("Year"))
print(df.head())
你可以试试这个:
import pandas as pd
import requests
from bs4 import BeautifulSoup
data = requests.get("https://it.m.wikipedia.org/wiki/Premio_Bagutta")
raw = BeautifulSoup(data.content, "html.parser").find_all(
"section", class_="mf-section-2 collapsible-block"
)[0]
raw_years = [item.text.replace("\n", "") for item in raw.find_all("p")]
raw_authors = [item for item in raw.find_all("ul")]
# For some years, there are several authors, so you have to iterate in sync
years = []
authors = []
for (year, author) in zip(raw_years, raw_authors):
years.append(year)
authors.append(author.text.split("\n"))
df = pd.DataFrame({"year": years, "author": authors}).explode("author")
print(df)
# Output
year author
0 1927 Giovanni Battista Angioletti, Il giorno del giudizio[11][12] (Ribet)
1 1928 Giovanni Comisso, Gente di mare[13][14] (Treves)
2 1929 Vincenzo Cardarelli, Il sole a picco[15][16] (Mondadori)
3 1930 Gino Rocca, Gli ultimi furono i primi[17][18] (Treves)
4 1931 Giovanni Titta Rosa, Il varco nel muro[19][20] (Carabba)
.. ... ...
82 2018 Helena Janeczek, La ragazza con la Leica[154] (Guanda)
83 2019 Marco Balzano, Resto qui[9][155] (Einaudi)
84 2020 Enrico Deaglio, La bomba[8][156] (Feltrinelli)
85 2021 Giorgio Fontana, Prima di noi[157] (Sellerio)
86 2022 Benedetta Craveri, La contessa[158][159] (Adelphi)
我正在尝试从维基百科页面中抓取列表,而不是 table。它说“列表索引超出范围”:我该如何解决这个问题?
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://it.m.wikipedia.org/wiki/Premio_Bagutta'
data = requests.get(url)
soup= BeautifulSoup(data.content, "html.parser")
raw = soup.find_all("div", {"class": "div-col"})[0].find_all("li")
df = pd.DataFrame([[item.get_text().split(" ")[0],
item.find_next("a").get("title"),
item.find_next("i").get_text()[1:-1]]
for item in raw if item.find_next("i")],
columns=("Year"))
print(df.head())
你可以试试这个:
import pandas as pd
import requests
from bs4 import BeautifulSoup
data = requests.get("https://it.m.wikipedia.org/wiki/Premio_Bagutta")
raw = BeautifulSoup(data.content, "html.parser").find_all(
"section", class_="mf-section-2 collapsible-block"
)[0]
raw_years = [item.text.replace("\n", "") for item in raw.find_all("p")]
raw_authors = [item for item in raw.find_all("ul")]
# For some years, there are several authors, so you have to iterate in sync
years = []
authors = []
for (year, author) in zip(raw_years, raw_authors):
years.append(year)
authors.append(author.text.split("\n"))
df = pd.DataFrame({"year": years, "author": authors}).explode("author")
print(df)
# Output
year author
0 1927 Giovanni Battista Angioletti, Il giorno del giudizio[11][12] (Ribet)
1 1928 Giovanni Comisso, Gente di mare[13][14] (Treves)
2 1929 Vincenzo Cardarelli, Il sole a picco[15][16] (Mondadori)
3 1930 Gino Rocca, Gli ultimi furono i primi[17][18] (Treves)
4 1931 Giovanni Titta Rosa, Il varco nel muro[19][20] (Carabba)
.. ... ...
82 2018 Helena Janeczek, La ragazza con la Leica[154] (Guanda)
83 2019 Marco Balzano, Resto qui[9][155] (Einaudi)
84 2020 Enrico Deaglio, La bomba[8][156] (Feltrinelli)
85 2021 Giorgio Fontana, Prima di noi[157] (Sellerio)
86 2022 Benedetta Craveri, La contessa[158][159] (Adelphi)