Python - Beautiful Soup - 如何从标签中提取一段文字
Python - Beautiful Soup - How do i extract a single piece of text out of a tag
首先我想让你知道,就 python 和网络爬虫而言,我完全是个新手。
我尝试使用 BeautifulSoup.
在 coinmarketcap.com 上实施爬虫
硬币名称的 dom-树如下所示:
<h2 class="sc-1q9q90x-0 jCInrl h1" color="text">Polygon<small class="nameSymbol">MATIC</small></h2>
我提取名称的代码如下所示:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
def get_name(url):
start_url = "https://coinmarketcap.com/all/views/all/"
url = urljoin(start_url, url)
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
name = soup.find('h2', class_="sc-1q9q90x-0 jCInrl h1").text[0]
print(name)
url = "https://coinmarketcap.com/all/views/all/"
website = requests.get(url)
results = BeautifulSoup(website.text, "html.parser")
counter = 0
table = results.find('tbody')
for row in table.find_all('tr'):
found_coins = []
if counter == 10:
break
else:
try:
url = row.find("a", class_="cmc-link").attrs["href"]
name = get_name(url)
except AttributeError:
continue
(已编辑:现在显示所有代码。)
输出或函数如下所示:
BitcoinBTC
EthereumETH
Binance CoinBNB
TetherUSDT
SolanaSOL
CardanoADA
XRPXRP
PolkadotDOT
USD CoinUSDC
DogecoinDOGE
如您所见,h2 标签的文本与小标签的文本结合在一起。
如何只提取 h2 标签中的第一段文字?
感谢您的帮助,在此先致谢!
目前您正在获取整个 h2 元素及其子元素。一旦你有了 h2 元素,再次使用 find 来获取里面的小元素并输出它的 text
例如
h2 = soup.find('h2', class_="sc-1q9q90x-0 jCInrl h1")
name = h2.find('small').text
print(name)a
因为您只需要 h2 元素的文本而不需要任何子元素,请尝试以下操作
h2 = soup.find('h2', class_="sc-1q9q90x-0 jCInrl h1")
name = h2.contents[0]
print(name)
你可以这样做。
- Select
<h2>
标签并使用 .stripped_strings
获取其中的字符串列表
- 现在你有一个包含两个值的列表,你可以选择你需要的任何字符串。
这是完整的代码。
from bs4 import BeautifulSoup
s = """<h2 class="sc-1q9q90x-0 jCInrl h1" color="text">Polygon<small class="nameSymbol">MATIC</small></h2>"""
soup = BeautifulSoup(s, 'xml')
h = soup.find('h2')
print(list(h.stripped_strings))
['Polygon', 'MATIC']
您真的应该使用免费的 CoinMarketCap API。创建一个帐户,生成一个密钥,然后:
import requests
url = "https://pro-api.coinmarketcap.com/v1/cryptocurrency/listings/latest"
headers = {
"Accepts": "application/json",
"X-CMC_PRO_API_KEY": "YOUR_KEY_HERE",
}
result = requests.get(url, headers=headers).json()
for coin in result["data"]:
name = coin["name"]
symbol = coin["symbol"]
price = coin["quote"]["USD"]["price"]
print(f"{name}: 1 {symbol} = {price:0.2f} USD")
结果是:
Bitcoin: 1 BTC = 60420.34486452755 USD
Ethereum: 1 ETH = 4234.891529519587 USD
Binance Coin: 1 BNB = 581.3868214529973 USD
Tether: 1 USDT = 1.0001178308074172 USD
Solana: 1 SOL = 218.568842499844 USD
Cardano: 1 ADA = 1.8793870309352723 USD
...
首先我想让你知道,就 python 和网络爬虫而言,我完全是个新手。 我尝试使用 BeautifulSoup.
在 coinmarketcap.com 上实施爬虫硬币名称的 dom-树如下所示:
<h2 class="sc-1q9q90x-0 jCInrl h1" color="text">Polygon<small class="nameSymbol">MATIC</small></h2>
我提取名称的代码如下所示:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
def get_name(url):
start_url = "https://coinmarketcap.com/all/views/all/"
url = urljoin(start_url, url)
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
name = soup.find('h2', class_="sc-1q9q90x-0 jCInrl h1").text[0]
print(name)
url = "https://coinmarketcap.com/all/views/all/"
website = requests.get(url)
results = BeautifulSoup(website.text, "html.parser")
counter = 0
table = results.find('tbody')
for row in table.find_all('tr'):
found_coins = []
if counter == 10:
break
else:
try:
url = row.find("a", class_="cmc-link").attrs["href"]
name = get_name(url)
except AttributeError:
continue
(已编辑:现在显示所有代码。)
输出或函数如下所示:
BitcoinBTC
EthereumETH
Binance CoinBNB
TetherUSDT
SolanaSOL
CardanoADA
XRPXRP
PolkadotDOT
USD CoinUSDC
DogecoinDOGE
如您所见,h2 标签的文本与小标签的文本结合在一起。
如何只提取 h2 标签中的第一段文字?
感谢您的帮助,在此先致谢!
目前您正在获取整个 h2 元素及其子元素。一旦你有了 h2 元素,再次使用 find 来获取里面的小元素并输出它的 text
例如
h2 = soup.find('h2', class_="sc-1q9q90x-0 jCInrl h1")
name = h2.find('small').text
print(name)a
因为您只需要 h2 元素的文本而不需要任何子元素,请尝试以下操作
h2 = soup.find('h2', class_="sc-1q9q90x-0 jCInrl h1")
name = h2.contents[0]
print(name)
你可以这样做。
- Select
<h2>
标签并使用.stripped_strings
获取其中的字符串列表
- 现在你有一个包含两个值的列表,你可以选择你需要的任何字符串。
这是完整的代码。
from bs4 import BeautifulSoup
s = """<h2 class="sc-1q9q90x-0 jCInrl h1" color="text">Polygon<small class="nameSymbol">MATIC</small></h2>"""
soup = BeautifulSoup(s, 'xml')
h = soup.find('h2')
print(list(h.stripped_strings))
['Polygon', 'MATIC']
您真的应该使用免费的 CoinMarketCap API。创建一个帐户,生成一个密钥,然后:
import requests
url = "https://pro-api.coinmarketcap.com/v1/cryptocurrency/listings/latest"
headers = {
"Accepts": "application/json",
"X-CMC_PRO_API_KEY": "YOUR_KEY_HERE",
}
result = requests.get(url, headers=headers).json()
for coin in result["data"]:
name = coin["name"]
symbol = coin["symbol"]
price = coin["quote"]["USD"]["price"]
print(f"{name}: 1 {symbol} = {price:0.2f} USD")
结果是:
Bitcoin: 1 BTC = 60420.34486452755 USD
Ethereum: 1 ETH = 4234.891529519587 USD
Binance Coin: 1 BNB = 581.3868214529973 USD
Tether: 1 USDT = 1.0001178308074172 USD
Solana: 1 SOL = 218.568842499844 USD
Cardano: 1 ADA = 1.8793870309352723 USD
...