python BeautifulSoup 维基百科 Webscapping - 学习
python BeautifulSoup Wikipedia Webscapping -learning
我在学习Python和BeautifulSoup
我正在尝试进行一些网络抓取:
让我先描述一下我想要做什么?
维基页面:https://en.m.wikipedia.org/wiki/List_of_largest_banks
我正在尝试打印
<span class="mw-headline" id="By_market_capitalization" tabindex="0" role="button" aria-controls="content-collapsible-block-1" aria-expanded="true">By market capitalization</span>
我要打印出文字:By market capitalization
然后是table银行的正文:
例子:
按市值
Rank
Bank
Cap Rate
1
JP Morgan
466.1
2
Bank of China
300
一直到 50
我的代码是这样开始的:
from bs4 import
import requests
html_text = requests.get('https://en.wikipedia.org/wiki/List_of_largest_banks').text
soup = BeautifulSoup(html_text, 'lxml')
# text = soup.find('span', class_='mw-headline', id='By_market_capitalization').text
Ak_soup = soup.find_all('section', class_='mf-section-2 collapsible-block open-block', id='content-collapsible-block-1')
print(Ak_soup)
我相信我的问题更多是在 html 方面:
但我完全迷路了:
我检查了我认为要查找的元素和标签是
{section class_='mf-section-2 collapsible-block open-block'}
接近您的目标 - 找到标题及其下一个 table
并通过 pandas.read_html()
将其转换为数据框。
header = soup.select_one('h2:has(>#By_market_capitalization)')
pd.read_html(str(header.find_next('table')))[0]
或
header = soup.select_one('h2:has(>#By_market_capitalization)')
pd.read_html(html_text, match='Market cap')[0]
例子
from bs4 import BeautifulSoup
import requests
import panda as pd
html_text = requests.get('https://en.wikipedia.org/wiki/List_of_largest_banks').text
soup = BeautifulSoup(html_text, 'lxml')
header = soup.select_one('h2:has(>#By_market_capitalization)')
print(header.span.text)
print(pd.read_html(str(header.find_next('table')))[0].to_markdown(index=False))
输出
按市值
Rank
Bank name
Market cap(US$ billion)
1
JPMorgan Chase
466.21[5]
2
Industrial and Commercial Bank of China
295.65
3
Bank of America
279.73
4
Wells Fargo
214.34
5
China Construction Bank
207.98
6
Agricultural Bank of China
181.49
7
HSBC Holdings PLC
169.47
8
Citigroup Inc.
163.58
9
Bank of China
151.15
10
China Merchants Bank
133.37
11
Royal Bank of Canada
113.80
12
Toronto-Dominion Bank
106.61
...
因为您知道所需的页眉,所以您可以直接打印。然后使用 pandas,您可以使用来自目标 table 的唯一搜索词作为更直接的 select 方法:
import pandas as pd
df = pd.read_html('https://en.m.wikipedia.org/wiki/List_of_largest_banks', match = 'Market cap')[0].reset_index(level = 0, drop = True)
print('By market capitalization')
print()
print(df.to_markdown(index = False))
我在学习Python和BeautifulSoup
我正在尝试进行一些网络抓取:
让我先描述一下我想要做什么?
维基页面:https://en.m.wikipedia.org/wiki/List_of_largest_banks
我正在尝试打印
<span class="mw-headline" id="By_market_capitalization" tabindex="0" role="button" aria-controls="content-collapsible-block-1" aria-expanded="true">By market capitalization</span>
我要打印出文字:By market capitalization
然后是table银行的正文: 例子: 按市值
Rank | Bank | Cap Rate |
---|---|---|
1 | JP Morgan | 466.1 |
2 | Bank of China | 300 |
一直到 50
我的代码是这样开始的:
from bs4 import
import requests
html_text = requests.get('https://en.wikipedia.org/wiki/List_of_largest_banks').text
soup = BeautifulSoup(html_text, 'lxml')
# text = soup.find('span', class_='mw-headline', id='By_market_capitalization').text
Ak_soup = soup.find_all('section', class_='mf-section-2 collapsible-block open-block', id='content-collapsible-block-1')
print(Ak_soup)
我相信我的问题更多是在 html 方面: 但我完全迷路了: 我检查了我认为要查找的元素和标签是
{section class_='mf-section-2 collapsible-block open-block'}
接近您的目标 - 找到标题及其下一个 table
并通过 pandas.read_html()
将其转换为数据框。
header = soup.select_one('h2:has(>#By_market_capitalization)')
pd.read_html(str(header.find_next('table')))[0]
或
header = soup.select_one('h2:has(>#By_market_capitalization)')
pd.read_html(html_text, match='Market cap')[0]
例子
from bs4 import BeautifulSoup
import requests
import panda as pd
html_text = requests.get('https://en.wikipedia.org/wiki/List_of_largest_banks').text
soup = BeautifulSoup(html_text, 'lxml')
header = soup.select_one('h2:has(>#By_market_capitalization)')
print(header.span.text)
print(pd.read_html(str(header.find_next('table')))[0].to_markdown(index=False))
输出
按市值
Rank | Bank name | Market cap(US$ billion) |
---|---|---|
1 | JPMorgan Chase | 466.21[5] |
2 | Industrial and Commercial Bank of China | 295.65 |
3 | Bank of America | 279.73 |
4 | Wells Fargo | 214.34 |
5 | China Construction Bank | 207.98 |
6 | Agricultural Bank of China | 181.49 |
7 | HSBC Holdings PLC | 169.47 |
8 | Citigroup Inc. | 163.58 |
9 | Bank of China | 151.15 |
10 | China Merchants Bank | 133.37 |
11 | Royal Bank of Canada | 113.80 |
12 | Toronto-Dominion Bank | 106.61 |
...
因为您知道所需的页眉,所以您可以直接打印。然后使用 pandas,您可以使用来自目标 table 的唯一搜索词作为更直接的 select 方法:
import pandas as pd
df = pd.read_html('https://en.m.wikipedia.org/wiki/List_of_largest_banks', match = 'Market cap')[0].reset_index(level = 0, drop = True)
print('By market capitalization')
print()
print(df.to_markdown(index = False))