使用 BeautifulSoup 访问 <tr> 标签时遇到问题
Having trouble accessing <tr> tags with BeautifulSoup
我是 python 的新手,但我正在尝试使用 BeautifulSoup 创建网络抓取工具。我有一个包含姓名列表的电子表格,我用它来生成 url,它将把我带到一个包含 table 数据的网站。然后我试图获取一些数据并用它填充电子表格。使用 chrome 中的开发人员工具,我看到我想要的信息在标签下。使用 soup.select(tr) 我正在尝试生成这些标签的列表,然后我可以遍历这些标签以获取我想要的信息。
但是,这个调用每次都会生成一个空列表。当我导航到代码生成的 url 时,我被带到网站上的正确页面,在那里我可以找到我感兴趣的标签和信息。但是当我 print(soup.prettify() ), 我得到了一个非常精简的 html 版本,没有我感兴趣的标签或信息。
在这里,我 post 编辑了我的代码的相关部分,HTML 我正在尝试获取的片段以及我得到的压缩版本。很抱歉 post,但如果能得到任何帮助,我将不胜感激。
base_url = 'http://portal.vertnet.org/search?q=specificepithet:'
for x in range(1,list_length):
genus = sheet.cell(row = x, column = 2).value
epithet = sheet.cell(row = x, column = 3).value
url = base_url + str(epithet) + '+genus:' + str(genus) + '+hastissue:1'
res = requests.get(url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
table_rows = soup.select('tr')
print(len(table_rows))
tot_entries = min(5, len(table_rows))
ents = 0
prev_museums = []
while ents < tot_entries:
for y in range(2, tot_entries+2):
for x in len(table_rows):
first_cell = soup.select('td')[0]
museum = first_cell.getText()
if museum not in prev_museums:
other_sheet.cell(row = x, column = y).value = first_cell
prev_museums += first_cell[0:5]
ents +=1
r.save('completetissuelist.xlsx')
我正在尝试捕获多个 tr 标签中的第一个 td 标签。
<tr>
<!--
<td>CUMV Mammal specimens 21200</td>
-->
<td> CUMV Mammal specimens 21200</td>
<td>Mammalia: Sciurus carolinensis</td>
<td> United States, New York, Tompkins County: Ithaca, 505 Hector Street</td>
<td>Collector(s): Margaret Terrell; Preparator(s): Michi T. Schulenberg</td>
<td>female</td>
<!--<td> 2006</td>-->
<td>2006-03-29</td>
<td style="text-align:center">
<span class="glyphicon glyphicon-map-marker"></span>
</td>
<td style="text-align:center"></td> </tr>
最后,这是 BeautifulSoup 似乎正在解析的正文,减去了免责声明。
<body>
<div id="holder">
<div id="main-spinner">
</div>
<div id="header">
<!--
DISCLAIMER
-->
</div>
<div id="content">
</div>
<div id="footer">
<!--
DISCLAIMER
-->
<footer class="footer">
<div class="container">
<p>
VertNet | Funding by
<a href="http://nsf.gov" target="_blank">
<img src="https://www.nsf.gov/images/logos/nsf2.gif" width="30px"/>
</a>
</p>
</div>
</footer>
</div>
</div>
<script data-main="/js/main.js" src="/js/lib/require.js">
</script>
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-41203333-1', 'vertnet.org');
ga('send', 'pageview');
</script>
<script>
var $buoop = {c:2};
function $buo_f(){
var e = document.createElement("script");
e.src = "//browser-update.org/update.min.js";
document.body.appendChild(e);
};
try {document.addEventListener("DOMContentLoaded", $buo_f,false)}
catch(e){window.attachEvent("onload", $buo_f)}
</script>
</body>
再次抱歉,篇幅太长了,但如果能得到任何帮助,我将不胜感激。
搜索结果从 XHR POST 请求加载到 http://portal.vertnet.org/service/rpc/record.search
端点。在您的代码中模拟此请求并解析 JSON 响应(不涉及 HTML-解析):
import json
import requests
specific_epiphet = "cedrorum"
genus = "Bombycilla"
url = 'http://portal.vertnet.org/service/rpc/record.search'
payload = {
"limit": 100,
"q": json.dumps(
{"keywords": ["specificepithet:" + specific_epiphet, "genus:" + genus, "hastissue:1"]}
)
}
res = requests.post(url,
json=payload,
headers={'User-Agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36"})
data = res.json()
for item in data["items"]:
item_data = json.loads(item["json"])
print(item["id"] + " " + item_data["title"] + " " + item_data["scientificname"])
打印:
amnh/birds/dot-15423 AMNH Bird Collection Bombycilla cedrorum
amnh/birds/dot-15937 AMNH Bird Collection Bombycilla cedrorum
amnh/birds/dot-15938 AMNH Bird Collection Bombycilla cedrorum
amnh/birds/dot-15939 AMNH Bird Collection Bombycilla cedrorum
...
mvz/bird-specimens/http-arctos-database-museum-guid-mvz-bird-179106-seid-1065589 MVZ Bird Collection (Arctos) Bombycilla cedrorum
mvz/bird-specimens/http-arctos-database-museum-guid-mvz-bird-179116-seid-928935 MVZ Bird Collection (Arctos) Bombycilla cedrorum
mvz/bird-specimens/http-arctos-database-museum-guid-mvz-bird-179307-seid-1242383 MVZ Bird Collection (Arctos) Bombycilla cedrorum
我是 python 的新手,但我正在尝试使用 BeautifulSoup 创建网络抓取工具。我有一个包含姓名列表的电子表格,我用它来生成 url,它将把我带到一个包含 table 数据的网站。然后我试图获取一些数据并用它填充电子表格。使用 chrome 中的开发人员工具,我看到我想要的信息在标签下。使用 soup.select(tr) 我正在尝试生成这些标签的列表,然后我可以遍历这些标签以获取我想要的信息。
但是,这个调用每次都会生成一个空列表。当我导航到代码生成的 url 时,我被带到网站上的正确页面,在那里我可以找到我感兴趣的标签和信息。但是当我 print(soup.prettify() ), 我得到了一个非常精简的 html 版本,没有我感兴趣的标签或信息。
在这里,我 post 编辑了我的代码的相关部分,HTML 我正在尝试获取的片段以及我得到的压缩版本。很抱歉 post,但如果能得到任何帮助,我将不胜感激。
base_url = 'http://portal.vertnet.org/search?q=specificepithet:'
for x in range(1,list_length):
genus = sheet.cell(row = x, column = 2).value
epithet = sheet.cell(row = x, column = 3).value
url = base_url + str(epithet) + '+genus:' + str(genus) + '+hastissue:1'
res = requests.get(url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
table_rows = soup.select('tr')
print(len(table_rows))
tot_entries = min(5, len(table_rows))
ents = 0
prev_museums = []
while ents < tot_entries:
for y in range(2, tot_entries+2):
for x in len(table_rows):
first_cell = soup.select('td')[0]
museum = first_cell.getText()
if museum not in prev_museums:
other_sheet.cell(row = x, column = y).value = first_cell
prev_museums += first_cell[0:5]
ents +=1
r.save('completetissuelist.xlsx')
我正在尝试捕获多个 tr 标签中的第一个 td 标签。
<tr>
<!--
<td>CUMV Mammal specimens 21200</td>
-->
<td> CUMV Mammal specimens 21200</td>
<td>Mammalia: Sciurus carolinensis</td>
<td> United States, New York, Tompkins County: Ithaca, 505 Hector Street</td>
<td>Collector(s): Margaret Terrell; Preparator(s): Michi T. Schulenberg</td>
<td>female</td>
<!--<td> 2006</td>-->
<td>2006-03-29</td>
<td style="text-align:center">
<span class="glyphicon glyphicon-map-marker"></span>
</td>
<td style="text-align:center"></td> </tr>
最后,这是 BeautifulSoup 似乎正在解析的正文,减去了免责声明。
<body>
<div id="holder">
<div id="main-spinner">
</div>
<div id="header">
<!--
DISCLAIMER
-->
</div>
<div id="content">
</div>
<div id="footer">
<!--
DISCLAIMER
-->
<footer class="footer">
<div class="container">
<p>
VertNet | Funding by
<a href="http://nsf.gov" target="_blank">
<img src="https://www.nsf.gov/images/logos/nsf2.gif" width="30px"/>
</a>
</p>
</div>
</footer>
</div>
</div>
<script data-main="/js/main.js" src="/js/lib/require.js">
</script>
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-41203333-1', 'vertnet.org');
ga('send', 'pageview');
</script>
<script>
var $buoop = {c:2};
function $buo_f(){
var e = document.createElement("script");
e.src = "//browser-update.org/update.min.js";
document.body.appendChild(e);
};
try {document.addEventListener("DOMContentLoaded", $buo_f,false)}
catch(e){window.attachEvent("onload", $buo_f)}
</script>
</body>
再次抱歉,篇幅太长了,但如果能得到任何帮助,我将不胜感激。
搜索结果从 XHR POST 请求加载到 http://portal.vertnet.org/service/rpc/record.search
端点。在您的代码中模拟此请求并解析 JSON 响应(不涉及 HTML-解析):
import json
import requests
specific_epiphet = "cedrorum"
genus = "Bombycilla"
url = 'http://portal.vertnet.org/service/rpc/record.search'
payload = {
"limit": 100,
"q": json.dumps(
{"keywords": ["specificepithet:" + specific_epiphet, "genus:" + genus, "hastissue:1"]}
)
}
res = requests.post(url,
json=payload,
headers={'User-Agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36"})
data = res.json()
for item in data["items"]:
item_data = json.loads(item["json"])
print(item["id"] + " " + item_data["title"] + " " + item_data["scientificname"])
打印:
amnh/birds/dot-15423 AMNH Bird Collection Bombycilla cedrorum
amnh/birds/dot-15937 AMNH Bird Collection Bombycilla cedrorum
amnh/birds/dot-15938 AMNH Bird Collection Bombycilla cedrorum
amnh/birds/dot-15939 AMNH Bird Collection Bombycilla cedrorum
...
mvz/bird-specimens/http-arctos-database-museum-guid-mvz-bird-179106-seid-1065589 MVZ Bird Collection (Arctos) Bombycilla cedrorum
mvz/bird-specimens/http-arctos-database-museum-guid-mvz-bird-179116-seid-928935 MVZ Bird Collection (Arctos) Bombycilla cedrorum
mvz/bird-specimens/http-arctos-database-museum-guid-mvz-bird-179307-seid-1242383 MVZ Bird Collection (Arctos) Bombycilla cedrorum