无法使用 bs4 和 Python 从网页中提取报价列表
Trouble extracting list of quotes from webpage using bs4 and Python
我想使用 bs4 导航到一个网页并将页面上的所有引用提取到一个列表中。
我还想提取那个特定人物的总页数(页面底部的一个元素)
我目前使用的代码是这样的
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as soup
listOfQuotes = []
website = "https://www.brainyquote.com/authors/nassim-nicholas-taleb-quotes"
req = Request(website, headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
s = soup(webpage,"html.parser")
div_container = s.findAll("div", {"id":"quotesList"})
我在搜索引号的 div_container
对象时遇到问题。
最简单的方法是通过标题找到它们(所有引号都有):
import requests
from bs4 import BeautifulSoup
url = "https://www.brainyquote.com/authors/nassim-nicholas-taleb-quotes"r = requests.get(url)
soup = BeautifulSoup(r.text)
# We bring all the "a" that has the title "view quote"
all_a_quotes = soup.find_all("a", attrs={"title": "view quote"})
for a in all_a_quotes:
# do something...
print(a.text)
这将输出(总共60个):
I'm in favour of religion as a tamer of arrogance. For a Greek Orthodox, the idea of God as creator outside the human is not God in God's terms. My God isn't the God of George Bush.
You are rich if and only if money you refuse tastes better than money you accept.
If you take risks and face your fate with dignity, there is nothing you can do that makes you small; if you don't take risks, there is nothing you can do that makes you grand, nothing.
Steve Jobs, Bill Gates and Mark Zuckerberg didn't finish college. Too much emphasis is placed on formal education - I told my children not to worry about their grades but to enjoy learning.
[...]
Debt is a mistake between lender and borrower, and both should suffer.
Capitalism is about adventurers who get harmed by their mistakes, not people who harm others with their mistakes.
The next time you experience a blackout, take some solace by looking at the sky. You will not recognize it.
对于分页,我们查看最后一个元素“ul”是否存在(如果不存在,则只有一页),如果存在,我们计算它有多少个“li”,然后减去 2:
pagination = soup.select('ul[class*="pagination"]')
if not pagination:
pages = 0
else:
# we subtract two, that of next and that of previous
pages = len(pagination[0].find_all("li")) - 2
第一次帮忙所以如果不是最好的,我很抱歉。我是一个 Python 新手,所以我发现打印并保存到文件以查看程序正在查看的内容很有帮助。
我使用以下代码执行此操作:
#This open a file and sets it in “w or “write” mode. If 'export.txt' doesn't exist Python creates it!
file1 = open('export.txt', 'w')
#This writes whatever I want to the file.
file1.write("This is what I want in the file")
#This safely closes the file.
file1.close()
将此应用于您的代码:
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as soup
listOfQuotes = []
website = "https://www.brainyquote.com/authors/nassim-nicholas-taleb-quotes"
req = Request(website, headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
s = soup(webpage,"html.parser")
div_container = s.findAll("div", {"id":"quotesList"})
#because of findAll, I could not do file1.write(div_container) and instead had to iterate through each item in the list.
#findAll returns a type "bs4.element.ResultSet" which can't have .text on the end. However, by calling each item in the "bs4.element.ResultSet" by index, you can then apply .text to it.
#in this case there is only one element. That is to say, div_container[1] doesn't exist.
for i in range(len(div_container)):
file1 = open('export.txt', 'w')
#the .text returns just the text inside of the tag with none of the html coding.
file1.write(div_container[i].text)
file1.close()
这给了我们以下信息:
I'm in favour of religion as a tamer of arrogance. For a Greek
Orthodox, the idea of God as creator outside the human is not God in
God's terms. My God isn't the God of George Bush. Nassim Nicholas
Taleb
God Religion Arrogance etc.
那么这是怎么回事?
如果我们再次 运行 代码,但不是查看 div,而是使用 BS4 的美化方法查看实际的 HTML,如下所示:
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as soup
listOfQuotes = []
website = "https://www.brainyquote.com/authors/nassim-nicholas-taleb-quotes"
req = Request(website, headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
#.prettify() is here
s = soup(webpage,"html.parser").prettify()
file1 = open('export2.txt', 'w')
file1.write(s)
file1.close()
我们可以查看文本文档,看看 Python 看到了什么,其中的一个片段是:
<div class="bq_center ql_page">
<div class="reflow_body bq_center">
<div class="new-msnry-grid bqcpx grid-layout-hide" id="quotesList">
<div class="m-brick grid-item boxy clearfix bqQt r-width" id="qpos_1_1">
<a class="b-qt qt_530963 oncl_q" href="/quotes/nassim_nicholas_taleb_530963" title="view quote">
I'm in favour of religion as a tamer of arrogance. For a Greek Orthodox, the idea of God as creator outside the human is not God in God's terms. My God isn't the God of George Bush.
</a>
<a class="bq-aut qa_530963 oncl_a" href="/quotes/nassim_nicholas_taleb_530963" title="view author">
Nassim Nicholas Taleb
</a>
<div class="qbn-box">
<div class="sh-cont">
<a aria-label="Share this quote on Facebook" class="sh-fb sh-grey" href="/share/fb/530963" rel="nofollow" target="_blank">
<img alt="Share on Facebook" class="bq-fa" src="/st/img/4341377/fa/facebook-f.svg"/>
</a>
<a aria-label="Share this quote on Twitter" class="sh-tw sh-grey" href="/share/tw/530963?ti=Nassim+Nicholas+Taleb+Quotes" rel="nofollow" target="_blank">
<img alt="Share on Twitter" class="bq-fa" src="/st/img/4341377/fa/twitter.svg"/>
</a>
<a aria-label="Share this quote on LinkedIn" class="sh-tw sh-grey" href="/share/li/530963?ti=Nassim+Nicholas+Taleb+Quotes+-+BrainyQuote" rel="nofollow" target="_blank">
<img alt="Share on LinkedIn" class="bq-fa" src="/st/img/4341377/fa/linkedin-in.svg"/>
</a>
</div>
</div>
<div class="kw-box">
<a class="qkw-btn btn btn-xs oncl_klc" data-idx="0" href="/topics/god-quotes">
God
</a>
<a class="qkw-btn btn btn-xs oncl_klc" data-idx="1" href="/topics/religion-quotes">
Religion
</a>
<a class="qkw-btn btn btn-xs oncl_klc" data-idx="2" href="/topics/arrogance-quotes">
Arrogance
</a>
</div>
</div>
为什么这对您很重要?因为您正在拉取所有 div,因此缩减为 div 的包含文本是:
<div class="m-brick grid-item boxy clearfix bqQt r-width" id="qpos_1_1">
<a class="b-qt qt_530963 oncl_q" href="/quotes/nassim_nicholas_taleb_530963" title="view quote">
I'm in favour of religion as a tamer of arrogance. For a Greek Orthodox, the idea of God as creator outside the human is not God in God's terms. My God isn't the God of George Bush.
</a>
<a class="bq-aut qa_530963 oncl_a" href="/quotes/nassim_nicholas_taleb_530963" title="view author">
Nassim Nicholas Taleb
</a>
<div class="kw-box">
<a class="qkw-btn btn btn-xs oncl_klc" data-idx="0" href="/topics/god-quotes">
God
</a>
<a class="qkw-btn btn btn-xs oncl_klc" data-idx="1" href="/topics/religion-quotes">
Religion
</a>
<a class="qkw-btn btn btn-xs oncl_klc" data-idx="2" href="/topics/arrogance-quotes">
Arrogance
</div>
这是我们需要考虑如何最好地获取信息的地方。我们看到它在一个 div 标签中,也是一个 a 标签。但是,如果我们拉动它,我们最终将再次抓住同样的东西,所以我们需要找到引语独有的东西,而不是其他东西。
因此,如果我们回顾第二次导出,并比较引号周围的 a 标签:
<a class="b-qt qt_530963 oncl_q" href="/quotes/nassim_nicholas_taleb_530963" title="view quote">
<a class="b-qt qt_531016 oncl_q" href="/quotes/nassim_nicholas_taleb_531016" title="view quote">
我们可以看到 class 和 href 部分每次都在变化,不会有太大帮助,但标题中的信息保持不变,因此我们可以使用它。再次使用您的代码作为模板:
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as soup
listOfQuotes = []
website = "https://www.brainyquote.com/authors/nassim-nicholas-taleb-quotes"
req = Request(website, headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
s = soup(webpage,"html.parser")
quotes = s.find_all("a", attrs={"title": "view quote"})
for a in quotes:
listOfQuotes.append(a.text)
print(listOfQuotes)
对于你问题的第二部分,我会使用 Lucas 在我之前所说的内容,但我已将其改编为你的代码:
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as soup
listOfQuotes = []
website = "https://www.brainyquote.com/authors/nassim-nicholas-taleb-quotes"
req = Request(website, headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
s = soup(webpage,"html.parser")
# We bring all the "a" that has the title "view quote"
quotes = s.find_all("a", attrs={"title": "view quote"})
for a in quotes:
# do something...
listOfQuotes.append(a.text)
pagination = s.select('ul[class*="pagination"]')
if not pagination:
pages = 0
else:
# we subtract two, that of next and that of previous
pages = len(pagination[0].find_all("li")) - 2
我想使用 bs4 导航到一个网页并将页面上的所有引用提取到一个列表中。
我还想提取那个特定人物的总页数(页面底部的一个元素)
我目前使用的代码是这样的
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as soup
listOfQuotes = []
website = "https://www.brainyquote.com/authors/nassim-nicholas-taleb-quotes"
req = Request(website, headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
s = soup(webpage,"html.parser")
div_container = s.findAll("div", {"id":"quotesList"})
我在搜索引号的 div_container
对象时遇到问题。
最简单的方法是通过标题找到它们(所有引号都有):
import requests
from bs4 import BeautifulSoup
url = "https://www.brainyquote.com/authors/nassim-nicholas-taleb-quotes"r = requests.get(url)
soup = BeautifulSoup(r.text)
# We bring all the "a" that has the title "view quote"
all_a_quotes = soup.find_all("a", attrs={"title": "view quote"})
for a in all_a_quotes:
# do something...
print(a.text)
这将输出(总共60个):
I'm in favour of religion as a tamer of arrogance. For a Greek Orthodox, the idea of God as creator outside the human is not God in God's terms. My God isn't the God of George Bush.
You are rich if and only if money you refuse tastes better than money you accept.
If you take risks and face your fate with dignity, there is nothing you can do that makes you small; if you don't take risks, there is nothing you can do that makes you grand, nothing.
Steve Jobs, Bill Gates and Mark Zuckerberg didn't finish college. Too much emphasis is placed on formal education - I told my children not to worry about their grades but to enjoy learning.
[...]
Debt is a mistake between lender and borrower, and both should suffer.
Capitalism is about adventurers who get harmed by their mistakes, not people who harm others with their mistakes.
The next time you experience a blackout, take some solace by looking at the sky. You will not recognize it.
对于分页,我们查看最后一个元素“ul”是否存在(如果不存在,则只有一页),如果存在,我们计算它有多少个“li”,然后减去 2:
pagination = soup.select('ul[class*="pagination"]')
if not pagination:
pages = 0
else:
# we subtract two, that of next and that of previous
pages = len(pagination[0].find_all("li")) - 2
第一次帮忙所以如果不是最好的,我很抱歉。我是一个 Python 新手,所以我发现打印并保存到文件以查看程序正在查看的内容很有帮助。 我使用以下代码执行此操作:
#This open a file and sets it in “w or “write” mode. If 'export.txt' doesn't exist Python creates it!
file1 = open('export.txt', 'w')
#This writes whatever I want to the file.
file1.write("This is what I want in the file")
#This safely closes the file.
file1.close()
将此应用于您的代码:
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as soup
listOfQuotes = []
website = "https://www.brainyquote.com/authors/nassim-nicholas-taleb-quotes"
req = Request(website, headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
s = soup(webpage,"html.parser")
div_container = s.findAll("div", {"id":"quotesList"})
#because of findAll, I could not do file1.write(div_container) and instead had to iterate through each item in the list.
#findAll returns a type "bs4.element.ResultSet" which can't have .text on the end. However, by calling each item in the "bs4.element.ResultSet" by index, you can then apply .text to it.
#in this case there is only one element. That is to say, div_container[1] doesn't exist.
for i in range(len(div_container)):
file1 = open('export.txt', 'w')
#the .text returns just the text inside of the tag with none of the html coding.
file1.write(div_container[i].text)
file1.close()
这给了我们以下信息:
I'm in favour of religion as a tamer of arrogance. For a Greek Orthodox, the idea of God as creator outside the human is not God in God's terms. My God isn't the God of George Bush. Nassim Nicholas Taleb
God Religion Arrogance etc.
那么这是怎么回事?
如果我们再次 运行 代码,但不是查看 div,而是使用 BS4 的美化方法查看实际的 HTML,如下所示:
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as soup
listOfQuotes = []
website = "https://www.brainyquote.com/authors/nassim-nicholas-taleb-quotes"
req = Request(website, headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
#.prettify() is here
s = soup(webpage,"html.parser").prettify()
file1 = open('export2.txt', 'w')
file1.write(s)
file1.close()
我们可以查看文本文档,看看 Python 看到了什么,其中的一个片段是:
<div class="bq_center ql_page">
<div class="reflow_body bq_center">
<div class="new-msnry-grid bqcpx grid-layout-hide" id="quotesList">
<div class="m-brick grid-item boxy clearfix bqQt r-width" id="qpos_1_1">
<a class="b-qt qt_530963 oncl_q" href="/quotes/nassim_nicholas_taleb_530963" title="view quote">
I'm in favour of religion as a tamer of arrogance. For a Greek Orthodox, the idea of God as creator outside the human is not God in God's terms. My God isn't the God of George Bush.
</a>
<a class="bq-aut qa_530963 oncl_a" href="/quotes/nassim_nicholas_taleb_530963" title="view author">
Nassim Nicholas Taleb
</a>
<div class="qbn-box">
<div class="sh-cont">
<a aria-label="Share this quote on Facebook" class="sh-fb sh-grey" href="/share/fb/530963" rel="nofollow" target="_blank">
<img alt="Share on Facebook" class="bq-fa" src="/st/img/4341377/fa/facebook-f.svg"/>
</a>
<a aria-label="Share this quote on Twitter" class="sh-tw sh-grey" href="/share/tw/530963?ti=Nassim+Nicholas+Taleb+Quotes" rel="nofollow" target="_blank">
<img alt="Share on Twitter" class="bq-fa" src="/st/img/4341377/fa/twitter.svg"/>
</a>
<a aria-label="Share this quote on LinkedIn" class="sh-tw sh-grey" href="/share/li/530963?ti=Nassim+Nicholas+Taleb+Quotes+-+BrainyQuote" rel="nofollow" target="_blank">
<img alt="Share on LinkedIn" class="bq-fa" src="/st/img/4341377/fa/linkedin-in.svg"/>
</a>
</div>
</div>
<div class="kw-box">
<a class="qkw-btn btn btn-xs oncl_klc" data-idx="0" href="/topics/god-quotes">
God
</a>
<a class="qkw-btn btn btn-xs oncl_klc" data-idx="1" href="/topics/religion-quotes">
Religion
</a>
<a class="qkw-btn btn btn-xs oncl_klc" data-idx="2" href="/topics/arrogance-quotes">
Arrogance
</a>
</div>
</div>
为什么这对您很重要?因为您正在拉取所有 div,因此缩减为 div 的包含文本是:
<div class="m-brick grid-item boxy clearfix bqQt r-width" id="qpos_1_1">
<a class="b-qt qt_530963 oncl_q" href="/quotes/nassim_nicholas_taleb_530963" title="view quote">
I'm in favour of religion as a tamer of arrogance. For a Greek Orthodox, the idea of God as creator outside the human is not God in God's terms. My God isn't the God of George Bush.
</a>
<a class="bq-aut qa_530963 oncl_a" href="/quotes/nassim_nicholas_taleb_530963" title="view author">
Nassim Nicholas Taleb
</a>
<div class="kw-box">
<a class="qkw-btn btn btn-xs oncl_klc" data-idx="0" href="/topics/god-quotes">
God
</a>
<a class="qkw-btn btn btn-xs oncl_klc" data-idx="1" href="/topics/religion-quotes">
Religion
</a>
<a class="qkw-btn btn btn-xs oncl_klc" data-idx="2" href="/topics/arrogance-quotes">
Arrogance
</div>
这是我们需要考虑如何最好地获取信息的地方。我们看到它在一个 div 标签中,也是一个 a 标签。但是,如果我们拉动它,我们最终将再次抓住同样的东西,所以我们需要找到引语独有的东西,而不是其他东西。
因此,如果我们回顾第二次导出,并比较引号周围的 a 标签:
<a class="b-qt qt_530963 oncl_q" href="/quotes/nassim_nicholas_taleb_530963" title="view quote">
<a class="b-qt qt_531016 oncl_q" href="/quotes/nassim_nicholas_taleb_531016" title="view quote">
我们可以看到 class 和 href 部分每次都在变化,不会有太大帮助,但标题中的信息保持不变,因此我们可以使用它。再次使用您的代码作为模板:
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as soup
listOfQuotes = []
website = "https://www.brainyquote.com/authors/nassim-nicholas-taleb-quotes"
req = Request(website, headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
s = soup(webpage,"html.parser")
quotes = s.find_all("a", attrs={"title": "view quote"})
for a in quotes:
listOfQuotes.append(a.text)
print(listOfQuotes)
对于你问题的第二部分,我会使用 Lucas 在我之前所说的内容,但我已将其改编为你的代码:
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as soup
listOfQuotes = []
website = "https://www.brainyquote.com/authors/nassim-nicholas-taleb-quotes"
req = Request(website, headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
s = soup(webpage,"html.parser")
# We bring all the "a" that has the title "view quote"
quotes = s.find_all("a", attrs={"title": "view quote"})
for a in quotes:
# do something...
listOfQuotes.append(a.text)
pagination = s.select('ul[class*="pagination"]')
if not pagination:
pages = 0
else:
# we subtract two, that of next and that of previous
pages = len(pagination[0].find_all("li")) - 2