解析 Wiley 在线图书馆
Parse Wiley Online Library
我想从 Ullmann's Encyclopedia of Industrial Chemistry 和 Python 和 BeautifulSoup 中提取所有章节的 DOI。
所以从
<h2 class="meta__title meta__title__margin"><span class="hlFld-Title"><a href="/doi/10.1002/14356007.c01_c01.pub2">Aerogels</a></span></h2>
我想得到 "Aerogels" 和“/doi/full/10.1002/14356007.c01_c01.pub2”
更大的样本:
<ul class="chapter_meta meta__authors rlist--inline comma">
<li><span class="hlFld-ContribAuthor"><a href="/action/doSearch?ContribAuthorStored=H%C3%BCsing%2C+Nicola"><span>Nicola Hüsing</span></a></span></li>
<li><span class="hlFld-ContribAuthor"><a href="/action/doSearch?ContribAuthorStored=Schubert%2C+Ulrich"><span>Ulrich Schubert</span></a></span></li>
</ul><span class="meta__epubDate"><span>First published: </span>15 December 2006</span><div class="content-item-format-links">
<ul class="rlist--inline separator">
<li><a title="Abstract" href="/doi/abs/10.1002/14356007.c01_c01.pub2">Abstract</a></li>
<li><a title="Full text" href="/doi/full/10.1002/14356007.c01_c01.pub2">
Full text
</a></li>
我试过的标题:
span['hlFld-Title'].a
我试过的 DOI:
for link in soup.find_all('a'.title):
print(link.get('href'))
但遗憾的是我是一个十足的菜鸟(傻瓜)而且它不起作用。
网址是 https://onlinelibrary.wiley.com/browse/book/10.1002/14356007/title?startPage={1..59}
感谢您的帮助。
这是一个快速解决方案,将 "DOI;title" 对打印到命令行:
import requests
from bs4 import BeautifulSoup
for i in range(59):
page = requests.get("https://onlinelibrary.wiley.com/browse/book/10.1002/14356007/title?startPage={}".format(i))
soup = BeautifulSoup(page.content, 'lxml')
content = soup.findAll("span", class_="hlFld-Title")
for c in content:
print(c.a.get('href')+";"+c.get_text())
我想从 Ullmann's Encyclopedia of Industrial Chemistry 和 Python 和 BeautifulSoup 中提取所有章节的 DOI。
所以从
<h2 class="meta__title meta__title__margin"><span class="hlFld-Title"><a href="/doi/10.1002/14356007.c01_c01.pub2">Aerogels</a></span></h2>
我想得到 "Aerogels" 和“/doi/full/10.1002/14356007.c01_c01.pub2”
更大的样本:
<ul class="chapter_meta meta__authors rlist--inline comma">
<li><span class="hlFld-ContribAuthor"><a href="/action/doSearch?ContribAuthorStored=H%C3%BCsing%2C+Nicola"><span>Nicola Hüsing</span></a></span></li>
<li><span class="hlFld-ContribAuthor"><a href="/action/doSearch?ContribAuthorStored=Schubert%2C+Ulrich"><span>Ulrich Schubert</span></a></span></li>
</ul><span class="meta__epubDate"><span>First published: </span>15 December 2006</span><div class="content-item-format-links">
<ul class="rlist--inline separator">
<li><a title="Abstract" href="/doi/abs/10.1002/14356007.c01_c01.pub2">Abstract</a></li>
<li><a title="Full text" href="/doi/full/10.1002/14356007.c01_c01.pub2">
Full text
</a></li>
我试过的标题:
span['hlFld-Title'].a
我试过的 DOI:
for link in soup.find_all('a'.title):
print(link.get('href'))
但遗憾的是我是一个十足的菜鸟(傻瓜)而且它不起作用。
网址是 https://onlinelibrary.wiley.com/browse/book/10.1002/14356007/title?startPage={1..59}
感谢您的帮助。
这是一个快速解决方案,将 "DOI;title" 对打印到命令行:
import requests
from bs4 import BeautifulSoup
for i in range(59):
page = requests.get("https://onlinelibrary.wiley.com/browse/book/10.1002/14356007/title?startPage={}".format(i))
soup = BeautifulSoup(page.content, 'lxml')
content = soup.findAll("span", class_="hlFld-Title")
for c in content:
print(c.a.get('href')+";"+c.get_text())