如何根据需要的格式修改BeautifulSoup的get_text函数?
How to modify get_text function of BeautifulSoup according to required formatting?
我想抓取 this 网页。我正在使用 BeautifulSoup.
url="https://www.blockchain.com/btc/block/00000000000000000011898368c395f1c35d56ea9109d439256d935a4fe7d656"
page=requests.get(url)
soup=BeautifulSoup(page.text,'html.parser')
block_details=soup.find(class_="hnfgic-0 jlMXIC")
print block_details.get_text()
输出为:
Hash00000000000000000011898368c395f1c35d56ea9109d439256d935a4fe7d656Confirmations8Timestamp2019-11-21 17:52Height604806MinerSlushPoolNumber of Transactions2,003Difficulty12,973,235,968,799.78Merkle root49ee8cb431ef3e613fdc9ac3146335d1a608a0e6afb5cf9ab44c9ddc51acfbe9Version0x20000000Bits387,297,854Weight3,993,364 WUSize1,355,728 bytesNonce849,455,972Transaction Volume4560.73542334 BTCBlock Reward12.50000000 BTCFee Reward0.19346486 BTC
但我希望输出为:
Hash
00000000000000000011898368c395f1c35d56ea9109d439256d935a4fe7d656
Confirmations
8
Timestamp
2019-11-21 17:52
Height
604806
.
.
.
我打算对这个字符串使用 strsplit
函数。因此,两个文本之间的行尾分隔符将帮助我使用 strsplit("\n")
来区分字符串。
请帮忙。
编辑:Selenium 的 .text
函数生成了我想要的输出,但我想使用 BeautifulSoup.
进行修复
您可以将 separator='\n'
参数添加到 get_text()
方法:
import requests
from bs4 import BeautifulSoup
url="https://www.blockchain.com/btc/block/00000000000000000011898368c395f1c35d56ea9109d439256d935a4fe7d656"
page=requests.get(url)
soup=BeautifulSoup(page.text,'html.parser')
block_details=soup.find(class_="hnfgic-0 jlMXIC")
print(block_details.get_text(separator='\n')) # <-- note the separator parameter
打印:
Hash
00000000000000000011898368c395f1c35d56ea9109d439256d935a4fe7d656
Confirmations
13
Timestamp
2019-11-21 17:52
Height
604806
Miner
SlushPool
Number of Transactions
2,003
Difficulty
12,973,235,968,799.78
Merkle root
49ee8cb431ef3e613fdc9ac3146335d1a608a0e6afb5cf9ab44c9ddc51acfbe9
Version
0x20000000
Bits
387,297,854
Weight
3,993,364 WU
Size
1,355,728 bytes
Nonce
849,455,972
Transaction Volume
4560.73542334 BTC
Block Reward
12.50000000 BTC
Fee Reward
0.19346486 BTC
我想抓取 this 网页。我正在使用 BeautifulSoup.
url="https://www.blockchain.com/btc/block/00000000000000000011898368c395f1c35d56ea9109d439256d935a4fe7d656"
page=requests.get(url)
soup=BeautifulSoup(page.text,'html.parser')
block_details=soup.find(class_="hnfgic-0 jlMXIC")
print block_details.get_text()
输出为:
Hash00000000000000000011898368c395f1c35d56ea9109d439256d935a4fe7d656Confirmations8Timestamp2019-11-21 17:52Height604806MinerSlushPoolNumber of Transactions2,003Difficulty12,973,235,968,799.78Merkle root49ee8cb431ef3e613fdc9ac3146335d1a608a0e6afb5cf9ab44c9ddc51acfbe9Version0x20000000Bits387,297,854Weight3,993,364 WUSize1,355,728 bytesNonce849,455,972Transaction Volume4560.73542334 BTCBlock Reward12.50000000 BTCFee Reward0.19346486 BTC
但我希望输出为:
Hash
00000000000000000011898368c395f1c35d56ea9109d439256d935a4fe7d656
Confirmations
8
Timestamp
2019-11-21 17:52
Height
604806
.
.
.
我打算对这个字符串使用 strsplit
函数。因此,两个文本之间的行尾分隔符将帮助我使用 strsplit("\n")
来区分字符串。
请帮忙。
编辑:Selenium 的 .text
函数生成了我想要的输出,但我想使用 BeautifulSoup.
您可以将 separator='\n'
参数添加到 get_text()
方法:
import requests
from bs4 import BeautifulSoup
url="https://www.blockchain.com/btc/block/00000000000000000011898368c395f1c35d56ea9109d439256d935a4fe7d656"
page=requests.get(url)
soup=BeautifulSoup(page.text,'html.parser')
block_details=soup.find(class_="hnfgic-0 jlMXIC")
print(block_details.get_text(separator='\n')) # <-- note the separator parameter
打印:
Hash
00000000000000000011898368c395f1c35d56ea9109d439256d935a4fe7d656
Confirmations
13
Timestamp
2019-11-21 17:52
Height
604806
Miner
SlushPool
Number of Transactions
2,003
Difficulty
12,973,235,968,799.78
Merkle root
49ee8cb431ef3e613fdc9ac3146335d1a608a0e6afb5cf9ab44c9ddc51acfbe9
Version
0x20000000
Bits
387,297,854
Weight
3,993,364 WU
Size
1,355,728 bytes
Nonce
849,455,972
Transaction Volume
4560.73542334 BTC
Block Reward
12.50000000 BTC
Fee Reward
0.19346486 BTC