python BeautifulSoup 维基百科 Webscapping - 学习

Question

我在学习Python和BeautifulSoup

我正在尝试进行一些网络抓取：

让我先描述一下我想要做什么？

维基页面：https://en.m.wikipedia.org/wiki/List_of_largest_banks

我正在尝试打印

<span class="mw-headline" id="By_market_capitalization" tabindex="0" role="button" aria-controls="content-collapsible-block-1" aria-expanded="true">By market capitalization</span>

我要打印出文字：By market capitalization

然后是table银行的正文：例子：按市值

Rank	Bank	Cap Rate
1	JP Morgan	466.1
2	Bank of China	300

一直到 50

我的代码是这样开始的：

from bs4 import 
import requests 
            
html_text = requests.get('https://en.wikipedia.org/wiki/List_of_largest_banks').text
soup = BeautifulSoup(html_text, 'lxml')
# text = soup.find('span', class_='mw-headline', id='By_market_capitalization').text 
Ak_soup = soup.find_all('section', class_='mf-section-2 collapsible-block open-block', id='content-collapsible-block-1')
print(Ak_soup)

我相信我的问题更多是在 html 方面：但我完全迷路了：我检查了我认为要查找的元素和标签是

{section class_='mf-section-2 collapsible-block open-block'}

Answer 1

接近您的目标 - 找到标题及其下一个 table 并通过 pandas.read_html() 将其转换为数据框。

header = soup.select_one('h2:has(>#By_market_capitalization)')
pd.read_html(str(header.find_next('table')))[0]

或

header = soup.select_one('h2:has(>#By_market_capitalization)')
pd.read_html(html_text, match='Market cap')[0]

例子

from bs4 import BeautifulSoup
import requests
import panda as pd 
            
html_text = requests.get('https://en.wikipedia.org/wiki/List_of_largest_banks').text
soup = BeautifulSoup(html_text, 'lxml')

header = soup.select_one('h2:has(>#By_market_capitalization)')

print(header.span.text)
print(pd.read_html(str(header.find_next('table')))[0].to_markdown(index=False))

输出

按市值

Rank	Bank name	Market cap(US$ billion)
1	JPMorgan Chase	466.21[5]
2	Industrial and Commercial Bank of China	295.65
3	Bank of America	279.73
4	Wells Fargo	214.34
5	China Construction Bank	207.98
6	Agricultural Bank of China	181.49
7	HSBC Holdings PLC	169.47
8	Citigroup Inc.	163.58
9	Bank of China	151.15
10	China Merchants Bank	133.37
11	Royal Bank of Canada	113.80
12	Toronto-Dominion Bank	106.61

...

Answer 2

因为您知道所需的页眉，所以您可以直接打印。然后使用 pandas，您可以使用来自目标 table 的唯一搜索词作为更直接的 select 方法：

import pandas as pd

df = pd.read_html('https://en.m.wikipedia.org/wiki/List_of_largest_banks', match = 'Market cap')[0].reset_index(level = 0,  drop = True)
print('By market capitalization')
print()
print(df.to_markdown(index = False))

python BeautifulSoup 维基百科 Webscapping - 学习

python BeautifulSoup Wikipedia Webscapping -learning

html

python

wiki

beautifulsoup

web-scraping

例子

输出