使用 BeautifulSoup 抓取链接时遇到问题
Trouble with scraping links with BeautifulSoup
这是我的脚本:
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Firefox/78.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3',
'Referer': 'https://www.espncricinfo.com/',
'Upgrade-Insecure-Requests': '1',
'Connection': 'keep-alive',
'Pragma': 'no-cache',
'Cache-Control': 'no-cache',
}
url0 = 'https://www.boursorama.com/bourse/opcvm/'
results = requests.get(url0, headers = headers)
soup = BeautifulSoup(results.text, "html.parser")
n=2
m = 2
linklist = []
while n <= m:
# Source path
url = f"https://www.boursorama.com/bourse/opcvm/page-{n}?sortAsc=1"
results = requests.get(url0, headers = headers)
soup = BeautifulSoup(results.text, "html.parser")
links = soup.find_all('div', class_ = "o-pack__item u-ellipsis", attrs={"href"})
for i in links:
print(i.get('href'))
n = n+1
我得到了这个:
None
None
None
None
None
我不明白为什么,当我 运行 这个(我只是更改了一行代码:print(i)
:
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Firefox/78.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3',
'Referer': 'https://www.espncricinfo.com/',
'Upgrade-Insecure-Requests': '1',
'Connection': 'keep-alive',
'Pragma': 'no-cache',
'Cache-Control': 'no-cache',
}
url0 = 'https://www.boursorama.com/bourse/opcvm/'
results = requests.get(url0, headers = headers)
soup = BeautifulSoup(results.text, "html.parser")
n=2
m = 3
linklist = []
while n <= m:
# Source path
url = f"https://www.boursorama.com/bourse/opcvm/page-{n}?sortAsc=1"
results = requests.get(url0, headers = headers)
soup = BeautifulSoup(results.text, "html.parser")
links = soup.find_all('div', class_ = "o-pack__item u-ellipsis", attrs={"href"})
for i in links:
print(i)
n = n+1
我得到了这个:
<div class="o-pack__item u-ellipsis"><a class="c-link c-link--animated" href="/bourse/opcvm/cours/0P0001CBIB/" title="Allianz Global Government Bond W H EUR">Allianz Global Government Bond W H EUR</a></div>
<div class="o-pack__item u-ellipsis"><a class="c-link c-link--animated" href="/bourse/opcvm/cours/0P0001HEV7/" title="Allianz Global Credit SRI WT Hedged SEK">Allianz Global Credit SRI WT Hedged SEK</a></div>
<div class="o-pack__item u-ellipsis"><a class="c-link c-link--animated" href="/bourse/opcvm/cours/0P0001AM5S/" title="Barings Global High Yield Bond C AUD Acc">Barings Global High Yield Bond C AUD Acc</a></div>
<div class="o-pack__item u-ellipsis"><a class="c-link c-link--animated" href="/bourse/opcvm/cours/0P00012TCF/" title="GS NA Engy & Engy Infras Eq Base Inc USD">GS NA Engy & Engy Infras Eq Base Inc USD</a></div>
<div class="o-pack__item u-ellipsis"><a class="c-link c-link--animated" href="/bourse/opcvm/cours/0P0001ESKF/" title="DWS Invest Enh Cmdty Strat USD TFC">DWS Invest Enh Cmdty Strat USD TFC</a></div>
我们可以看到 href
标签,我在网上搜索但我总是看到 i.get('href')
或 i['href']
并且总是以 None
.
会发生什么?
您正在 selecting <div>
并且它们没有属性 href
,<a>
包含在 <div>
<div class="o-pack__item u-ellipsis"><a class="c-link c-link--animated" href="/bourse/opcvm/cours/0P0001CBIB/" title="Allianz Global Government Bond W H EUR">Allianz Global Government Bond W H EUR</a></div>
如何修复
如果你想打印href
的值,你必须先select它:
print(i.a.get('href'))
另一种选择是 select 您的目标更具体,例如css selectors
:
links = soup.select('div.o-pack__item.u-ellipsis a[href]')
for i in links:
print(i.get('href'))
例子
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Firefox/78.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3',
'Referer': 'https://www.espncricinfo.com/',
'Upgrade-Insecure-Requests': '1',
'Connection': 'keep-alive',
'Pragma': 'no-cache',
'Cache-Control': 'no-cache',
}
url0 = 'https://www.boursorama.com/bourse/opcvm/'
results = requests.get(url0, headers = headers)
soup = BeautifulSoup(results.text, "html.parser")
n=2
m = 2
linklist = []
while n <= m:
# Source path
url = f"https://www.boursorama.com/bourse/opcvm/page-{n}?sortAsc=1"
results = requests.get(url0, headers = headers)
soup = BeautifulSoup(results.text, "html.parser")
links = soup.find_all('div', class_ = "o-pack__item u-ellipsis", attrs={"href"})
for i in links:
print(i.a.get('href'))
n = n+1
这是我的脚本:
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Firefox/78.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3',
'Referer': 'https://www.espncricinfo.com/',
'Upgrade-Insecure-Requests': '1',
'Connection': 'keep-alive',
'Pragma': 'no-cache',
'Cache-Control': 'no-cache',
}
url0 = 'https://www.boursorama.com/bourse/opcvm/'
results = requests.get(url0, headers = headers)
soup = BeautifulSoup(results.text, "html.parser")
n=2
m = 2
linklist = []
while n <= m:
# Source path
url = f"https://www.boursorama.com/bourse/opcvm/page-{n}?sortAsc=1"
results = requests.get(url0, headers = headers)
soup = BeautifulSoup(results.text, "html.parser")
links = soup.find_all('div', class_ = "o-pack__item u-ellipsis", attrs={"href"})
for i in links:
print(i.get('href'))
n = n+1
我得到了这个:
None
None
None
None
None
我不明白为什么,当我 运行 这个(我只是更改了一行代码:print(i)
:
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Firefox/78.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3',
'Referer': 'https://www.espncricinfo.com/',
'Upgrade-Insecure-Requests': '1',
'Connection': 'keep-alive',
'Pragma': 'no-cache',
'Cache-Control': 'no-cache',
}
url0 = 'https://www.boursorama.com/bourse/opcvm/'
results = requests.get(url0, headers = headers)
soup = BeautifulSoup(results.text, "html.parser")
n=2
m = 3
linklist = []
while n <= m:
# Source path
url = f"https://www.boursorama.com/bourse/opcvm/page-{n}?sortAsc=1"
results = requests.get(url0, headers = headers)
soup = BeautifulSoup(results.text, "html.parser")
links = soup.find_all('div', class_ = "o-pack__item u-ellipsis", attrs={"href"})
for i in links:
print(i)
n = n+1
我得到了这个:
<div class="o-pack__item u-ellipsis"><a class="c-link c-link--animated" href="/bourse/opcvm/cours/0P0001CBIB/" title="Allianz Global Government Bond W H EUR">Allianz Global Government Bond W H EUR</a></div>
<div class="o-pack__item u-ellipsis"><a class="c-link c-link--animated" href="/bourse/opcvm/cours/0P0001HEV7/" title="Allianz Global Credit SRI WT Hedged SEK">Allianz Global Credit SRI WT Hedged SEK</a></div>
<div class="o-pack__item u-ellipsis"><a class="c-link c-link--animated" href="/bourse/opcvm/cours/0P0001AM5S/" title="Barings Global High Yield Bond C AUD Acc">Barings Global High Yield Bond C AUD Acc</a></div>
<div class="o-pack__item u-ellipsis"><a class="c-link c-link--animated" href="/bourse/opcvm/cours/0P00012TCF/" title="GS NA Engy & Engy Infras Eq Base Inc USD">GS NA Engy & Engy Infras Eq Base Inc USD</a></div>
<div class="o-pack__item u-ellipsis"><a class="c-link c-link--animated" href="/bourse/opcvm/cours/0P0001ESKF/" title="DWS Invest Enh Cmdty Strat USD TFC">DWS Invest Enh Cmdty Strat USD TFC</a></div>
我们可以看到 href
标签,我在网上搜索但我总是看到 i.get('href')
或 i['href']
并且总是以 None
.
会发生什么?
您正在 selecting <div>
并且它们没有属性 href
,<a>
包含在 <div>
<div class="o-pack__item u-ellipsis"><a class="c-link c-link--animated" href="/bourse/opcvm/cours/0P0001CBIB/" title="Allianz Global Government Bond W H EUR">Allianz Global Government Bond W H EUR</a></div>
如何修复
如果你想打印href
的值,你必须先select它:
print(i.a.get('href'))
另一种选择是 select 您的目标更具体,例如css selectors
:
links = soup.select('div.o-pack__item.u-ellipsis a[href]')
for i in links:
print(i.get('href'))
例子
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Firefox/78.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3',
'Referer': 'https://www.espncricinfo.com/',
'Upgrade-Insecure-Requests': '1',
'Connection': 'keep-alive',
'Pragma': 'no-cache',
'Cache-Control': 'no-cache',
}
url0 = 'https://www.boursorama.com/bourse/opcvm/'
results = requests.get(url0, headers = headers)
soup = BeautifulSoup(results.text, "html.parser")
n=2
m = 2
linklist = []
while n <= m:
# Source path
url = f"https://www.boursorama.com/bourse/opcvm/page-{n}?sortAsc=1"
results = requests.get(url0, headers = headers)
soup = BeautifulSoup(results.text, "html.parser")
links = soup.find_all('div', class_ = "o-pack__item u-ellipsis", attrs={"href"})
for i in links:
print(i.a.get('href'))
n = n+1