Python BeautifulSoup 没有提取每个 URL
Python BeautifulSoup not extracting every URL
我正在尝试查找此页面上的所有网址:https://courses.students.ubc.ca/cs/courseschedule?pname=subjarea&tname=subj-all-departments
更具体地说,我想要在每个“主题代码”下超链接的链接。然而,当我 运行 我的代码时,几乎没有任何链接被提取。
我想知道为什么会这样,我该如何解决。
from bs4 import BeautifulSoup
import requests
url = "https://courses.students.ubc.ca/cs/courseschedule?pname=subjarea&tname=subj-all-departments"
page = requests.get(url)
soup = BeautifulSoup(page.text, features="lxml")
for link in soup.find_all('a'):
print(link.get('href'))
这是我第一次尝试网络抓取..
有anti-bot保护,给你的headers加一个user-agent就行了。当出现问题时不要忘记检查你的汤
from bs4 import BeautifulSoup
import requests
url = "https://courses.students.ubc.ca/cs/courseschedule?pname=subjarea&tname=subj-all-departments"
ua={'User-Agent':'Mozilla/5.0 (Macintosh; PPC Mac OS X 10_8_2) AppleWebKit/531.2 (KHTML, like Gecko) Chrome/26.0.869.0 Safari/531.2'}
r = requests.get(url, headers=ua)
soup = BeautifulSoup(r.text, features="lxml")
for link in soup.find_all('a'):
print(link.get('href'))
汤中的信息是
Sorry for the inconvenience.
We have detected excess or unusual web requests originating from your browser, and are unable to determine whether these requests are automated.
To proceed to the requested page, please complete the captcha below.
我会使用 nth-child(1)
来限制到与 id 匹配的 table 的第一列。然后简单地提取.text
。如果包含 *
,则为未提供的课程提供默认字符串,否则,将检索到的课程标识符连接到基本查询字符串结构上:
import requests
from bs4 import BeautifulSoup as bs
headers = {'User-Agent': 'Mozilla/5.0'}
r = requests.get('https://courses.students.ubc.ca/cs/courseschedule?pname=subjarea&tname=subj-all-departments', headers=headers)
soup = bs(r.content, 'lxml')
no_course = ''
base = 'https://courses.students.ubc.ca/cs/courseschedule?pname=subjarea&tname=subj-department&dept='
course_info = {i.text:(no_course if '*' in i.text else base + i.text) for i in soup.select('#mainTable td:nth-child(1)')}
course_info
我正在尝试查找此页面上的所有网址:https://courses.students.ubc.ca/cs/courseschedule?pname=subjarea&tname=subj-all-departments
更具体地说,我想要在每个“主题代码”下超链接的链接。然而,当我 运行 我的代码时,几乎没有任何链接被提取。
我想知道为什么会这样,我该如何解决。
from bs4 import BeautifulSoup
import requests
url = "https://courses.students.ubc.ca/cs/courseschedule?pname=subjarea&tname=subj-all-departments"
page = requests.get(url)
soup = BeautifulSoup(page.text, features="lxml")
for link in soup.find_all('a'):
print(link.get('href'))
这是我第一次尝试网络抓取..
有anti-bot保护,给你的headers加一个user-agent就行了。当出现问题时不要忘记检查你的汤
from bs4 import BeautifulSoup
import requests
url = "https://courses.students.ubc.ca/cs/courseschedule?pname=subjarea&tname=subj-all-departments"
ua={'User-Agent':'Mozilla/5.0 (Macintosh; PPC Mac OS X 10_8_2) AppleWebKit/531.2 (KHTML, like Gecko) Chrome/26.0.869.0 Safari/531.2'}
r = requests.get(url, headers=ua)
soup = BeautifulSoup(r.text, features="lxml")
for link in soup.find_all('a'):
print(link.get('href'))
汤中的信息是
Sorry for the inconvenience.
We have detected excess or unusual web requests originating from your browser, and are unable to determine whether these requests are automated.
To proceed to the requested page, please complete the captcha below.
我会使用 nth-child(1)
来限制到与 id 匹配的 table 的第一列。然后简单地提取.text
。如果包含 *
,则为未提供的课程提供默认字符串,否则,将检索到的课程标识符连接到基本查询字符串结构上:
import requests
from bs4 import BeautifulSoup as bs
headers = {'User-Agent': 'Mozilla/5.0'}
r = requests.get('https://courses.students.ubc.ca/cs/courseschedule?pname=subjarea&tname=subj-all-departments', headers=headers)
soup = bs(r.content, 'lxml')
no_course = ''
base = 'https://courses.students.ubc.ca/cs/courseschedule?pname=subjarea&tname=subj-department&dept='
course_info = {i.text:(no_course if '*' in i.text else base + i.text) for i in soup.select('#mainTable td:nth-child(1)')}
course_info