网站可以使用我的一半网络抓取代码,但另一半会给出错误消息
Website works with half my webscraping code but other half gives an error message
我是网络抓取的新手,我很难弄清楚如何处理一个问题:我正在抓取的网站正在与我的一半代码合作,但现在是另一半。
我正在使用以下抓取代码从 mmadecisions.com 抓取数据。我成功地拉出第一页 links,然后成功打开那些 links 的页面,但是当我到达第三页 'layer' 时,它给我一个错误。是javascript吗?这很奇怪,因为当我将 href link 输入到 'get_single_item_data' 函数时,它 运行 完美无缺。这是否意味着我应该使用硒?它是网站的一个街区吗?那么为什么一半的抓取工作(对于 http://mmadecisions.com/decisions-by-event/2013/ & http://mmadecisions.com/decision/4801/John-Maguire-vs-Phil-Mulpeter)正如你在我下面的输出中看到的那样,我在到达第三层之前打印了 href links。:
import requests
from bs4 import BeautifulSoup
import time
my_headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36"}
def ufc_spider(max_pages):
page = 2013
while page <= max_pages:
url = 'http://mmadecisions.com/decisions-by-event/'+str(page)+'/'
print(url)
source_code = requests.get(url, headers=my_headers)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
data = soup.findAll('table',{'width':'100%'})[2]
for link in data.findAll('a', href=True):
href = 'http://mmadecisions.com/' + str(link.get('href'))
source_code = requests.get(href, "html.parser")
plain_text = source_code.text
soup2 = BeautifulSoup(plain_text, "html.parser")
tmp = []
other = soup2.findAll('table',{'width':'100%'})[1]
for con in other.findAll('td', {'class':'list2'}):
CON = con.a
ahref = 'http://mmadecisions.com/' + str(CON.get('href'))
print(ahref)
time.sleep(5)
get_single_item_data(ahref)
page += 1
def get_single_item_data(item_url):
tmp = []
source_code = requests.get(item_url, headers=my_headers)
time.sleep(10)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
print(soup)
ufc_spider(2017)
这是我能够获取网站 urls 但它不会让我从第二个 url[=20= 获取数据的输出]:
http://mmadecisions.com/decisions-by-event/2013/
http://mmadecisions.com/decision/4801/John-Maguire-vs-Phil-Mulpeter
<html><head><title>Apache Tomcat/7.0.68 (Ubuntu) - Error report</title><style><!--H1 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:22px;} H2 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:16px;} H3 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:14px;} BODY {font-family:Tahoma,Arial,sans-serif;color:black;background-color:white;} B {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;} P {font-family:Tahoma,Arial,sans-serif;background:white;color:black;font-size:12px;}A {color : black;}A.name {color : black;}HR {color : #525D76;}--></style> </head><body><h1>HTTP Status 404 - /decision/4801/John-Maguire-vs-Phil-Mulpeter%0D%0A</h1><hr noshade="noshade" size="1"/><p><b>type</b> Status report</p><p><b>message</b> <u>/decision/4801/John-Maguire-vs-Phil-Mulpeter%0D%0A</u></p><p><b>description</b> <u>The requested resource is not available.</u></p><hr noshade="noshade" size="1"/><h3>Apache Tomcat/7.0.68 (Ubuntu)</h3></body></html>
http://mmadecisions.com/decision/4793/Amanda-English-vs-Slavka-Vitaly
<html><head><title>Apache Tomcat/7.0.68 (Ubuntu) - Error report</title><style><!--H1 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:22px;} H2 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:16px;} H3 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:14px;} BODY {font-family:Tahoma,Arial,sans-serif;color:black;background-color:white;} B {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;} P {font-family:Tahoma,Arial,sans-serif;background:white;color:black;font-size:12px;}A {color : black;}A.name {color : black;}HR {color : #525D76;}--></style> </head><body><h1>HTTP Status 404 - /decision/4793/Amanda-English-vs-Slavka-Vitaly%0D%0A</h1><hr noshade="noshade" size="1"/><p><b>type</b> Status report</p><p><b>message</b> <u>/decision/4793/Amanda-English-vs-Slavka-Vitaly%0D%0A</u></p><p><b>description</b> <u>The requested resource is not available.</u></p><hr noshade="noshade" size="1"/><h3>Apache Tomcat/7.0.68 (Ubuntu)</h3></body></html>
http://mmadecisions.com/decision/4792/Chris-Boujard-vs-Peter-Queally
......
我已经尝试更改用户代理 header,我已经尝试进行时间延迟,并且我已经 运行 我的 VPN 代码。 None 正在工作,并且都给出相同的输出。
请帮忙!
import requests
from bs4 import BeautifulSoup
links = []
for item in range(2013, 2020):
print(f"{'-'*30}Extracting Year# {item}{'-'*30}")
r = requests.get(f"http://mmadecisions.com/decisions-by-event/{item}/")
soup = BeautifulSoup(r.text, 'html.parser')
for item in soup.findAll('a', {'href': True}):
item = item.get('href')
if item.startswith('event'):
print(f"http://mmadecisions.com/{item}")
links.append(f"http://mmadecisions.com/{item}")
print("\nNow Fetching all urls inside Years..\n")
for item in links:
r = requests.get(item)
soup = BeautifulSoup(r.text, 'html.parser')
for item in soup.findAll('a', {'href': True}):
item = item.get('href')
if item.startswith('decision/'):
print(f"http://mmadecisions.com/{item}".strip())
运行在线代码:Click Here
请注意,您可以使用以下内容:
for item in soup.findAll('td', {'class': 'list'}):
for an in item.findAll('a'):
print(an.get('href'))
和
for item in soup.findAll('td', {'class': 'list2'}):
for an in item.findAll('a'):
print(an.get('href').strip())
我是网络抓取的新手,我很难弄清楚如何处理一个问题:我正在抓取的网站正在与我的一半代码合作,但现在是另一半。 我正在使用以下抓取代码从 mmadecisions.com 抓取数据。我成功地拉出第一页 links,然后成功打开那些 links 的页面,但是当我到达第三页 'layer' 时,它给我一个错误。是javascript吗?这很奇怪,因为当我将 href link 输入到 'get_single_item_data' 函数时,它 运行 完美无缺。这是否意味着我应该使用硒?它是网站的一个街区吗?那么为什么一半的抓取工作(对于 http://mmadecisions.com/decisions-by-event/2013/ & http://mmadecisions.com/decision/4801/John-Maguire-vs-Phil-Mulpeter)正如你在我下面的输出中看到的那样,我在到达第三层之前打印了 href links。:
import requests
from bs4 import BeautifulSoup
import time
my_headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36"}
def ufc_spider(max_pages):
page = 2013
while page <= max_pages:
url = 'http://mmadecisions.com/decisions-by-event/'+str(page)+'/'
print(url)
source_code = requests.get(url, headers=my_headers)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
data = soup.findAll('table',{'width':'100%'})[2]
for link in data.findAll('a', href=True):
href = 'http://mmadecisions.com/' + str(link.get('href'))
source_code = requests.get(href, "html.parser")
plain_text = source_code.text
soup2 = BeautifulSoup(plain_text, "html.parser")
tmp = []
other = soup2.findAll('table',{'width':'100%'})[1]
for con in other.findAll('td', {'class':'list2'}):
CON = con.a
ahref = 'http://mmadecisions.com/' + str(CON.get('href'))
print(ahref)
time.sleep(5)
get_single_item_data(ahref)
page += 1
def get_single_item_data(item_url):
tmp = []
source_code = requests.get(item_url, headers=my_headers)
time.sleep(10)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
print(soup)
ufc_spider(2017)
这是我能够获取网站 urls 但它不会让我从第二个 url[=20= 获取数据的输出]:
http://mmadecisions.com/decisions-by-event/2013/
http://mmadecisions.com/decision/4801/John-Maguire-vs-Phil-Mulpeter
<html><head><title>Apache Tomcat/7.0.68 (Ubuntu) - Error report</title><style><!--H1 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:22px;} H2 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:16px;} H3 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:14px;} BODY {font-family:Tahoma,Arial,sans-serif;color:black;background-color:white;} B {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;} P {font-family:Tahoma,Arial,sans-serif;background:white;color:black;font-size:12px;}A {color : black;}A.name {color : black;}HR {color : #525D76;}--></style> </head><body><h1>HTTP Status 404 - /decision/4801/John-Maguire-vs-Phil-Mulpeter%0D%0A</h1><hr noshade="noshade" size="1"/><p><b>type</b> Status report</p><p><b>message</b> <u>/decision/4801/John-Maguire-vs-Phil-Mulpeter%0D%0A</u></p><p><b>description</b> <u>The requested resource is not available.</u></p><hr noshade="noshade" size="1"/><h3>Apache Tomcat/7.0.68 (Ubuntu)</h3></body></html>
http://mmadecisions.com/decision/4793/Amanda-English-vs-Slavka-Vitaly
<html><head><title>Apache Tomcat/7.0.68 (Ubuntu) - Error report</title><style><!--H1 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:22px;} H2 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:16px;} H3 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:14px;} BODY {font-family:Tahoma,Arial,sans-serif;color:black;background-color:white;} B {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;} P {font-family:Tahoma,Arial,sans-serif;background:white;color:black;font-size:12px;}A {color : black;}A.name {color : black;}HR {color : #525D76;}--></style> </head><body><h1>HTTP Status 404 - /decision/4793/Amanda-English-vs-Slavka-Vitaly%0D%0A</h1><hr noshade="noshade" size="1"/><p><b>type</b> Status report</p><p><b>message</b> <u>/decision/4793/Amanda-English-vs-Slavka-Vitaly%0D%0A</u></p><p><b>description</b> <u>The requested resource is not available.</u></p><hr noshade="noshade" size="1"/><h3>Apache Tomcat/7.0.68 (Ubuntu)</h3></body></html>
http://mmadecisions.com/decision/4792/Chris-Boujard-vs-Peter-Queally
......
我已经尝试更改用户代理 header,我已经尝试进行时间延迟,并且我已经 运行 我的 VPN 代码。 None 正在工作,并且都给出相同的输出。 请帮忙!
import requests
from bs4 import BeautifulSoup
links = []
for item in range(2013, 2020):
print(f"{'-'*30}Extracting Year# {item}{'-'*30}")
r = requests.get(f"http://mmadecisions.com/decisions-by-event/{item}/")
soup = BeautifulSoup(r.text, 'html.parser')
for item in soup.findAll('a', {'href': True}):
item = item.get('href')
if item.startswith('event'):
print(f"http://mmadecisions.com/{item}")
links.append(f"http://mmadecisions.com/{item}")
print("\nNow Fetching all urls inside Years..\n")
for item in links:
r = requests.get(item)
soup = BeautifulSoup(r.text, 'html.parser')
for item in soup.findAll('a', {'href': True}):
item = item.get('href')
if item.startswith('decision/'):
print(f"http://mmadecisions.com/{item}".strip())
运行在线代码:Click Here
请注意,您可以使用以下内容:
for item in soup.findAll('td', {'class': 'list'}):
for an in item.findAll('a'):
print(an.get('href'))
和
for item in soup.findAll('td', {'class': 'list2'}):
for an in item.findAll('a'):
print(an.get('href').strip())