BeautifulSoup:find_all() 和 unicode 有问题吗?
BeautifulSoup: An issue with find_all() and unicode?
所以我正在使用 BeautifulSoup 构建一个网络爬虫来抓取 Craigslist 页面上的每个广告。到目前为止,这是我得到的:
import requests
from bs4 import BeautifulSoup, SoupStrainer
import bs4
page = "http://miami.craigslist.org/search/roo?query=brickell"
search_html = requests.get(page).text
roomSoup = BeautifulSoup(search_html, "html.parser")
ad_list = roomSoup.find_all("a", {"class":"hdrlnk"})
#print ad_list
ad_ls = [item["href"] for item in ad_list]
#print ad_ls
ad_urls = ["miami.craigslist.org" + ad for ad in ad_ls]
#print ad_urls
url_str = [str(unicode) for unicode in ad_urls]
# What's in url_str?
for url in url_str:
print url
当我 运行 这个时,我得到:
miami.craigslist.org/mdc/roo/4870912192.html
miami.craigslist.org/mdc/roo/4858122981.html
miami.craigslist.org/mdc/roo/4870665175.html
miami.craigslist.org/mdc/roo/4857247075.html
miami.craigslist.org/mdc/roo/4870540048.html ...
这正是我想要的:包含页面上每个广告的 URL 的列表。
我的下一步是从每个页面中提取一些东西;因此构建另一个 BeautifulSoup 对象。但是我突然停下来了:
for url in url_str:
ad_html = requests.get(str(url)).text
我们终于到了我的问题:这个错误到底是什么?我唯一能理解的是最后两行:
Traceback (most recent call last): File "webscraping.py", line 24,
in <module>
ad_html = requests.get(str(url)).text File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests/api.py",
line 65, in get
return request('get', url, **kwargs) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests/api.py",
line 49, in request
response = session.request(method=method, url=url, **kwargs) File
"/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests/sessions.py",
line 447, in request
prep = self.prepare_request(req) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests/sessions.py",
line 378, in prepare_request
hooks=merge_hooks(request.hooks, self.hooks), File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests/models.py",
line 303, in prepare
self.prepare_url(url, params) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests/models.py",
line 360, in prepare_url
"Perhaps you meant http://{0}?".format(url)) requests.exceptions.MissingSchema: Invalid URL
u'miami.craigslist.org/mdc/roo/4870912192.html': No schema supplied.
Perhaps you meant http://miami.craigslist.org/mdc/roo/4870912192.html?
问题似乎是我的所有链接都以 u' 开头,因此 requests.get() 不起作用。这就是为什么你看到我几乎试图用 str() 将所有 URL 强制转换为常规字符串。但是,无论我做什么,我都会收到此错误。还有什么我想念的吗?我完全误解了我的问题吗?
在此先致谢!
看来你误解了问题
留言:
u'miami.craigslist.org/mdc/roo/4870912192.html': No schema supplied.
Perhaps you meant http://miami.craigslist.org/mdc/roo/4870912192.html?
表示它在 url
之前缺少 http://
(架构)
所以替换
ad_urls = ["miami.craigslist.org" + ad for ad in ad_ls]
来自
ad_urls = ["http://miami.craigslist.org" + ad for ad in ad_ls]
应该做这份工作
所以我正在使用 BeautifulSoup 构建一个网络爬虫来抓取 Craigslist 页面上的每个广告。到目前为止,这是我得到的:
import requests
from bs4 import BeautifulSoup, SoupStrainer
import bs4
page = "http://miami.craigslist.org/search/roo?query=brickell"
search_html = requests.get(page).text
roomSoup = BeautifulSoup(search_html, "html.parser")
ad_list = roomSoup.find_all("a", {"class":"hdrlnk"})
#print ad_list
ad_ls = [item["href"] for item in ad_list]
#print ad_ls
ad_urls = ["miami.craigslist.org" + ad for ad in ad_ls]
#print ad_urls
url_str = [str(unicode) for unicode in ad_urls]
# What's in url_str?
for url in url_str:
print url
当我 运行 这个时,我得到:
miami.craigslist.org/mdc/roo/4870912192.html miami.craigslist.org/mdc/roo/4858122981.html miami.craigslist.org/mdc/roo/4870665175.html miami.craigslist.org/mdc/roo/4857247075.html miami.craigslist.org/mdc/roo/4870540048.html ...
这正是我想要的:包含页面上每个广告的 URL 的列表。
我的下一步是从每个页面中提取一些东西;因此构建另一个 BeautifulSoup 对象。但是我突然停下来了:
for url in url_str:
ad_html = requests.get(str(url)).text
我们终于到了我的问题:这个错误到底是什么?我唯一能理解的是最后两行:
Traceback (most recent call last): File "webscraping.py", line 24,
in <module>
ad_html = requests.get(str(url)).text File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests/api.py",
line 65, in get
return request('get', url, **kwargs) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests/api.py",
line 49, in request
response = session.request(method=method, url=url, **kwargs) File
"/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests/sessions.py",
line 447, in request
prep = self.prepare_request(req) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests/sessions.py",
line 378, in prepare_request
hooks=merge_hooks(request.hooks, self.hooks), File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests/models.py",
line 303, in prepare
self.prepare_url(url, params) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests/models.py",
line 360, in prepare_url
"Perhaps you meant http://{0}?".format(url)) requests.exceptions.MissingSchema: Invalid URL
u'miami.craigslist.org/mdc/roo/4870912192.html': No schema supplied.
Perhaps you meant http://miami.craigslist.org/mdc/roo/4870912192.html?
问题似乎是我的所有链接都以 u' 开头,因此 requests.get() 不起作用。这就是为什么你看到我几乎试图用 str() 将所有 URL 强制转换为常规字符串。但是,无论我做什么,我都会收到此错误。还有什么我想念的吗?我完全误解了我的问题吗?
在此先致谢!
看来你误解了问题
留言:
u'miami.craigslist.org/mdc/roo/4870912192.html': No schema supplied.
Perhaps you meant http://miami.craigslist.org/mdc/roo/4870912192.html?
表示它在 url
之前缺少http://
(架构)
所以替换
ad_urls = ["miami.craigslist.org" + ad for ad in ad_ls]
来自
ad_urls = ["http://miami.craigslist.org" + ad for ad in ad_ls]
应该做这份工作