有选择地使用 Python 抓取维基百科表格
Scraping Wikipedia tables with Python selectively
我在整理一个 wiki 时遇到了麻烦 table 希望以前做过的人能给我建议。
从 List_of_current_heads_of_state_and_government 我需要国家(使用下面的代码),然后只第一次提到国家元首 + 他们的名字。我不确定如何隔离第一次提到的内容,因为它们都在一个单元格中。我试图提取他们的名字给了我这个错误:IndexError: list index out of range
。感谢您的帮助!
import requests
from bs4 import BeautifulSoup
wiki = "https://en.wikipedia.org/wiki/List_of_current_heads_of_state_and_government"
website_url = requests.get(wiki).text
soup = BeautifulSoup(website_url,'lxml')
my_table = soup.find('table',{'class':'wikitable plainrowheaders'})
#print(my_table)
states = []
titles = []
names = []
for row in my_table.find_all('tr')[1:]:
state_cell = row.find_all('a')[0]
states.append(state_cell.text)
print(states)
for row in my_table.find_all('td'):
title_cell = row.find_all('a')[0]
titles.append(title_cell.text)
print(titles)
for row in my_table.find_all('td'):
name_cell = row.find_all('a')[1]
names.append(name_cell.text)
print(names)
理想的输出是 pandas df:
State | Title | Name |
它并不完美,但它几乎可以像这样工作。
import requests
from bs4 import BeautifulSoup
wiki = "https://en.wikipedia.org/wiki/List_of_current_heads_of_state_and_government"
website_url = requests.get(wiki).text
soup = BeautifulSoup(website_url,'lxml')
my_table = soup.find('table',{'class':'wikitable plainrowheaders'})
#print(my_table)
states = []
titles = []
names = []
""" for row in my_table.find_all('tr')[1:]:
state_cell = row.find_all('a')[0]
states.append(state_cell.text)
print(states)
for row in my_table.find_all('td'):
title_cell = row.find_all('a')[0]
titles.append(title_cell.text)
print(titles) """
for row in my_table.find_all('td'):
try:
names.append(row.find_all('a')[1].text)
except IndexError:
names.append(row.find_all('a')[0].text)
print(names)
到目前为止我看到的名单中只有一个错误。由于必须编写异常,table 有点困难。例如,有些名称不是 link,然后代码只捕获它在该行中找到的第一个 link。但是你只需要为这种情况多写一些 if 子句。至少我会这样做。
如果我能理解你的问题,那么下面的内容应该可以帮助你解决问题:
import requests
from bs4 import BeautifulSoup
URL = "https://en.wikipedia.org/wiki/List_of_current_heads_of_state_and_government"
res = requests.get(URL).text
soup = BeautifulSoup(res,'lxml')
for items in soup.find('table', class_='wikitable').find_all('tr')[1::1]:
data = items.find_all(['th','td'])
try:
country = data[0].a.text
title = data[1].a.text
name = data[1].a.find_next_sibling().text
except IndexError:pass
print("{}|{}|{}".format(country,title,name))
输出:
Afghanistan|President|Ashraf Ghani
Albania|President|Ilir Meta
Algeria|President|Abdelaziz Bouteflika
Andorra|Episcopal Co-Prince|Joan Enric Vives Sicília
Angola|President|João Lourenço
Antigua and Barbuda|Queen|Elizabeth II
Argentina|President|Mauricio Macri
以此类推----
我找到了一种超级简单快捷的方法,通过导入 wikipedia
python 模块,然后使用 pandas' read_html
将其放入数据框。
从那里您可以应用任意数量的分析。
import pandas as pd
import wikipedia as wp
html = wp.page("List_of_video_games_considered_the_best").html().encode("UTF-8")
try:
df = pd.read_html(html)[1] # Try 2nd table first as most pages contain contents table first
except IndexError:
df = pd.read_html(html)[0]
print(df.to_string())
或者如果您想从命令行调用它:
只需通过python yourfile.py -p Wikipedia_Page_Article_Here
调用
import pandas as pd
import argparse
import wikipedia as wp
parser = argparse.ArgumentParser()
parser.add_argument("-p", "--wiki_page", help="Give a wiki page to get table", required=True)
args = parser.parse_args()
html = wp.page(args.wiki_page).html().encode("UTF-8")
try:
df = pd.read_html(html)[1] # Try 2nd table first as most pages contain contents table first
except IndexError:
df = pd.read_html(html)[0]
print(df.to_string())
希望这对外面的人有帮助!
我在整理一个 wiki 时遇到了麻烦 table 希望以前做过的人能给我建议。
从 List_of_current_heads_of_state_and_government 我需要国家(使用下面的代码),然后只第一次提到国家元首 + 他们的名字。我不确定如何隔离第一次提到的内容,因为它们都在一个单元格中。我试图提取他们的名字给了我这个错误:IndexError: list index out of range
。感谢您的帮助!
import requests
from bs4 import BeautifulSoup
wiki = "https://en.wikipedia.org/wiki/List_of_current_heads_of_state_and_government"
website_url = requests.get(wiki).text
soup = BeautifulSoup(website_url,'lxml')
my_table = soup.find('table',{'class':'wikitable plainrowheaders'})
#print(my_table)
states = []
titles = []
names = []
for row in my_table.find_all('tr')[1:]:
state_cell = row.find_all('a')[0]
states.append(state_cell.text)
print(states)
for row in my_table.find_all('td'):
title_cell = row.find_all('a')[0]
titles.append(title_cell.text)
print(titles)
for row in my_table.find_all('td'):
name_cell = row.find_all('a')[1]
names.append(name_cell.text)
print(names)
理想的输出是 pandas df:
State | Title | Name |
它并不完美,但它几乎可以像这样工作。
import requests
from bs4 import BeautifulSoup
wiki = "https://en.wikipedia.org/wiki/List_of_current_heads_of_state_and_government"
website_url = requests.get(wiki).text
soup = BeautifulSoup(website_url,'lxml')
my_table = soup.find('table',{'class':'wikitable plainrowheaders'})
#print(my_table)
states = []
titles = []
names = []
""" for row in my_table.find_all('tr')[1:]:
state_cell = row.find_all('a')[0]
states.append(state_cell.text)
print(states)
for row in my_table.find_all('td'):
title_cell = row.find_all('a')[0]
titles.append(title_cell.text)
print(titles) """
for row in my_table.find_all('td'):
try:
names.append(row.find_all('a')[1].text)
except IndexError:
names.append(row.find_all('a')[0].text)
print(names)
到目前为止我看到的名单中只有一个错误。由于必须编写异常,table 有点困难。例如,有些名称不是 link,然后代码只捕获它在该行中找到的第一个 link。但是你只需要为这种情况多写一些 if 子句。至少我会这样做。
如果我能理解你的问题,那么下面的内容应该可以帮助你解决问题:
import requests
from bs4 import BeautifulSoup
URL = "https://en.wikipedia.org/wiki/List_of_current_heads_of_state_and_government"
res = requests.get(URL).text
soup = BeautifulSoup(res,'lxml')
for items in soup.find('table', class_='wikitable').find_all('tr')[1::1]:
data = items.find_all(['th','td'])
try:
country = data[0].a.text
title = data[1].a.text
name = data[1].a.find_next_sibling().text
except IndexError:pass
print("{}|{}|{}".format(country,title,name))
输出:
Afghanistan|President|Ashraf Ghani
Albania|President|Ilir Meta
Algeria|President|Abdelaziz Bouteflika
Andorra|Episcopal Co-Prince|Joan Enric Vives Sicília
Angola|President|João Lourenço
Antigua and Barbuda|Queen|Elizabeth II
Argentina|President|Mauricio Macri
以此类推----
我找到了一种超级简单快捷的方法,通过导入 wikipedia
python 模块,然后使用 pandas' read_html
将其放入数据框。
从那里您可以应用任意数量的分析。
import pandas as pd
import wikipedia as wp
html = wp.page("List_of_video_games_considered_the_best").html().encode("UTF-8")
try:
df = pd.read_html(html)[1] # Try 2nd table first as most pages contain contents table first
except IndexError:
df = pd.read_html(html)[0]
print(df.to_string())
或者如果您想从命令行调用它:
只需通过python yourfile.py -p Wikipedia_Page_Article_Here
import pandas as pd
import argparse
import wikipedia as wp
parser = argparse.ArgumentParser()
parser.add_argument("-p", "--wiki_page", help="Give a wiki page to get table", required=True)
args = parser.parse_args()
html = wp.page(args.wiki_page).html().encode("UTF-8")
try:
df = pd.read_html(html)[1] # Try 2nd table first as most pages contain contents table first
except IndexError:
df = pd.read_html(html)[0]
print(df.to_string())
希望这对外面的人有帮助!