python BeautifulSoup 维基百科 Webscapping - 学习

python BeautifulSoup Wikipedia Webscapping -learning

我在学习Python和BeautifulSoup

我正在尝试进行一些网络抓取:

让我先描述一下我想要做什么?

维基页面:https://en.m.wikipedia.org/wiki/List_of_largest_banks

我正在尝试打印

<span class="mw-headline" id="By_market_capitalization" tabindex="0" role="button" aria-controls="content-collapsible-block-1" aria-expanded="true">By market capitalization</span>

我要打印出文字:By market capitalization

然后是table银行的正文: 例子: 按市值

Rank Bank Cap Rate
1 JP Morgan 466.1
2 Bank of China 300

一直到 50

我的代码是这样开始的:

from bs4 import 
import requests 
            
html_text = requests.get('https://en.wikipedia.org/wiki/List_of_largest_banks').text
soup = BeautifulSoup(html_text, 'lxml')
# text = soup.find('span', class_='mw-headline', id='By_market_capitalization').text 
Ak_soup = soup.find_all('section', class_='mf-section-2 collapsible-block open-block', id='content-collapsible-block-1')
print(Ak_soup) 

我相信我的问题更多是在 html 方面: 但我完全迷路了: 我检查了我认为要查找的元素和标签是

{section class_='mf-section-2 collapsible-block open-block'}

接近您的目标 - 找到标题及其下一个 table 并通过 pandas.read_html() 将其转换为数据框。

header = soup.select_one('h2:has(>#By_market_capitalization)')
pd.read_html(str(header.find_next('table')))[0]

header = soup.select_one('h2:has(>#By_market_capitalization)')
pd.read_html(html_text, match='Market cap')[0]
例子
from bs4 import BeautifulSoup
import requests
import panda as pd 
            
html_text = requests.get('https://en.wikipedia.org/wiki/List_of_largest_banks').text
soup = BeautifulSoup(html_text, 'lxml')

header = soup.select_one('h2:has(>#By_market_capitalization)')

print(header.span.text)
print(pd.read_html(str(header.find_next('table')))[0].to_markdown(index=False))
输出

按市值

Rank Bank name Market cap(US$ billion)
1 JPMorgan Chase 466.21[5]
2 Industrial and Commercial Bank of China 295.65
3 Bank of America 279.73
4 Wells Fargo 214.34
5 China Construction Bank 207.98
6 Agricultural Bank of China 181.49
7 HSBC Holdings PLC 169.47
8 Citigroup Inc. 163.58
9 Bank of China 151.15
10 China Merchants Bank 133.37
11 Royal Bank of Canada 113.80
12 Toronto-Dominion Bank 106.61

...

因为您知道所需的页眉,所以您可以直接打印。然后使用 pandas,您可以使用来自目标 table 的唯一搜索词作为更直接的 select 方法:

import pandas as pd

df = pd.read_html('https://en.m.wikipedia.org/wiki/List_of_largest_banks', match = 'Market cap')[0].reset_index(level = 0,  drop = True)
print('By market capitalization')
print()
print(df.to_markdown(index = False))