使用 Python 从维基百科中抓取 table
Scraping table from Wikipedia with Python
我正在尝试使用 Python 和 Beautiful Soup 抓取维基百科 Table。当我尝试使用 for 循环获取 table 列属性时,出现错误:
NameError Traceback (most recent call last)
<ipython-input-18-948408e65d8d> in <module>
1 # Header attributes of the table
2 header=[th.text.rstrip()
----> 3 for th in rows[0].find_all('th')]
4 print(header)
5 print('------------')
NameError: name 'rows' is not defined
我该如何解决这个问题?
代码:
url="https://en.wikipedia.org/wiki/List_of_municipalities_of_Norway"
Initiating multiple URL requests. If the request is successful, then expected HTTP response status code is 200.
s=requests.Session()
response=s.get(url, timeout=10)
response
res = requests.get(url)
soup = BeautifulSoup(res.content, 'html.parser')
Title of wikipedia page
soup.title.string
Get the right table to scrape
right_table=soup.find('table',{"class":'sortable wikitable'})
Header attributes of the table
header=[th.text.rstrip()
for th in rows[0].find_all('th')]
print(header)
print('------------')
print(len(header))
您可以使用 pandas
,这对您的情况来说非常简单:
import pandas as pd
tables = pd.read_html("https://en.wikipedia.org/wiki/List_of_municipalities_of_Norway")
right_table = tables[1]
输出
| | Number[1](ISO 3166-2:NO) | Name | Adm. center | County | Population(2017)[2] | Area(km²)[3] | CountyMap | Arms | Language form[4] | Mayor[5] | Party |
|----:|---------------------------:|:-----------------------------|:---------------------|:---------------------|----------------------:|---------------:|------------:|-------:|:-------------------------|:----------------------------|:--------|
| 0 | 301 | Oslo | Oslo | Oslo | 673469 | 454.03 | nan | nan | Neutral | Marianne Borgen | SV |
| 1 | 1101 | Eigersund | Egersund | Rogaland | 14898 | 431.66 | nan | nan | Bokmål | Leif Erik Egaas | H |
| 2 | 1103 | Stavanger | Stavanger | Rogaland | 141186 | 262.52 | nan | nan | Bokmål | Kari Nessa Nordtun | Ap |
| 3 | 1106 | Haugesund | Haugesund | Rogaland |
我正在尝试使用 Python 和 Beautiful Soup 抓取维基百科 Table。当我尝试使用 for 循环获取 table 列属性时,出现错误:
NameError Traceback (most recent call last)
<ipython-input-18-948408e65d8d> in <module>
1 # Header attributes of the table
2 header=[th.text.rstrip()
----> 3 for th in rows[0].find_all('th')]
4 print(header)
5 print('------------')
NameError: name 'rows' is not defined
我该如何解决这个问题?
代码:
url="https://en.wikipedia.org/wiki/List_of_municipalities_of_Norway"
Initiating multiple URL requests. If the request is successful, then expected HTTP response status code is 200.
s=requests.Session()
response=s.get(url, timeout=10)
response
res = requests.get(url)
soup = BeautifulSoup(res.content, 'html.parser')
Title of wikipedia page
soup.title.string
Get the right table to scrape
right_table=soup.find('table',{"class":'sortable wikitable'})
Header attributes of the table
header=[th.text.rstrip()
for th in rows[0].find_all('th')]
print(header)
print('------------')
print(len(header))
您可以使用 pandas
,这对您的情况来说非常简单:
import pandas as pd
tables = pd.read_html("https://en.wikipedia.org/wiki/List_of_municipalities_of_Norway")
right_table = tables[1]
输出
| | Number[1](ISO 3166-2:NO) | Name | Adm. center | County | Population(2017)[2] | Area(km²)[3] | CountyMap | Arms | Language form[4] | Mayor[5] | Party |
|----:|---------------------------:|:-----------------------------|:---------------------|:---------------------|----------------------:|---------------:|------------:|-------:|:-------------------------|:----------------------------|:--------|
| 0 | 301 | Oslo | Oslo | Oslo | 673469 | 454.03 | nan | nan | Neutral | Marianne Borgen | SV |
| 1 | 1101 | Eigersund | Egersund | Rogaland | 14898 | 431.66 | nan | nan | Bokmål | Leif Erik Egaas | H |
| 2 | 1103 | Stavanger | Stavanger | Rogaland | 141186 | 262.52 | nan | nan | Bokmål | Kari Nessa Nordtun | Ap |
| 3 | 1106 | Haugesund | Haugesund | Rogaland |