获取 Urls 列表,然后在 Python 3.5.1 中从所有 Urls 中查找特定文本
Getting a list of Urls and then finding specific text from all of them in Python 3.5.1
所以我有这段代码,它将以列表格式 url 提供我需要的
import requests
from bs4 import BeautifulSoup
offset = 0
links = []
with requests.Session() as session:
while True:
r = session.get("http://rayleighev.deviantart.com/gallery/44021661/Reddit?offset=%d" % offset)
soup = BeautifulSoup(r.content, "html.parser")
new_links = soup.find_all("a", {'class' : "thumb"})
# no more links - break the loop
if not new_links:
break
# denotes the number of gallery pages gone through at one time (# of pages times 24 equals the number below)
links.extend(new_links)
print(len(links))
offset += 24
#denotes the number of gallery pages(# of pages times 24 equals the number below)
if offset == 48:
break
for link in links:
print(link.get("href"))
之后,我尝试从所有 url 中获取不同的文本,并且所有这些文本在每个文本上都位于相对相同的位置。但是,每当我 运行 下面的下半部分时,我都会收到一大块 html 文本和一些错误,我不确定如何修复它或者是否还有其他错误,并且最好更简单,从每个 url.
获取文本的方法
import urllib.request
import re
for link in links:
url = print("%s" % link)
headers = {}
headers['User-Agent'] = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'
req = urllib.request.Request(url, headers = headers)
resp = urllib.request.urlopen(req)
respData = resp.read()
paragraphs = re.findall(r'</a><br /><br />(.*?)</div>', str(respData))
if paragraphs != None:
paragraphs = re.findall(r'<br /><br />(.*?)</span>', str(respData))
if paragraphs != None:
paragraphs = re.findall(r'<br /><br />(.*?)</span></div>', str(respData))
for eachP in paragraphs:
print(eachP)
title = re.findall(r'<title>(.*?)</title>', str(respData))
for eachT in title:
print(eachT)
您的代码:
for link in links:
url = print("%s" % link)
将 None 分配给 url。也许你的意思是:
for link in links:
url = "%s" % link.get("href")
也没有理由使用 urllib 来获取网站内容,您可以像以前一样使用请求,方法是更改:
req = urllib.request.Request(url, headers = headers)
resp = urllib.request.urlopen(req)
respData = resp.read()
至
req = requests.get(url, headers=headers)
soup = BeautifulSoup(req.content, "html.parser")
现在您可以通过以下方式获取标题和段落:
title = soup.find('div', {'class': 'dev-title-container'}).h1.text
paragraph = soup.find('div', {'class': 'text block'}).text
所以我有这段代码,它将以列表格式 url 提供我需要的
import requests
from bs4 import BeautifulSoup
offset = 0
links = []
with requests.Session() as session:
while True:
r = session.get("http://rayleighev.deviantart.com/gallery/44021661/Reddit?offset=%d" % offset)
soup = BeautifulSoup(r.content, "html.parser")
new_links = soup.find_all("a", {'class' : "thumb"})
# no more links - break the loop
if not new_links:
break
# denotes the number of gallery pages gone through at one time (# of pages times 24 equals the number below)
links.extend(new_links)
print(len(links))
offset += 24
#denotes the number of gallery pages(# of pages times 24 equals the number below)
if offset == 48:
break
for link in links:
print(link.get("href"))
之后,我尝试从所有 url 中获取不同的文本,并且所有这些文本在每个文本上都位于相对相同的位置。但是,每当我 运行 下面的下半部分时,我都会收到一大块 html 文本和一些错误,我不确定如何修复它或者是否还有其他错误,并且最好更简单,从每个 url.
获取文本的方法import urllib.request
import re
for link in links:
url = print("%s" % link)
headers = {}
headers['User-Agent'] = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'
req = urllib.request.Request(url, headers = headers)
resp = urllib.request.urlopen(req)
respData = resp.read()
paragraphs = re.findall(r'</a><br /><br />(.*?)</div>', str(respData))
if paragraphs != None:
paragraphs = re.findall(r'<br /><br />(.*?)</span>', str(respData))
if paragraphs != None:
paragraphs = re.findall(r'<br /><br />(.*?)</span></div>', str(respData))
for eachP in paragraphs:
print(eachP)
title = re.findall(r'<title>(.*?)</title>', str(respData))
for eachT in title:
print(eachT)
您的代码:
for link in links:
url = print("%s" % link)
将 None 分配给 url。也许你的意思是:
for link in links:
url = "%s" % link.get("href")
也没有理由使用 urllib 来获取网站内容,您可以像以前一样使用请求,方法是更改:
req = urllib.request.Request(url, headers = headers)
resp = urllib.request.urlopen(req)
respData = resp.read()
至
req = requests.get(url, headers=headers)
soup = BeautifulSoup(req.content, "html.parser")
现在您可以通过以下方式获取标题和段落:
title = soup.find('div', {'class': 'dev-title-container'}).h1.text
paragraph = soup.find('div', {'class': 'text block'}).text