Python 请求 returns 其他随机内容 URL

Python Requests returns content of other random URL

所以,当我尝试使用 python 请求库抓取网页时,我有一个奇怪的行为。出于某种我不明白的原因,当我抓取网页内容时,我得到了另一个明显随机网页的数据。这是一个例子:

import requests
from bs4 import BeautifulSoup

def scrape_webpage(url):
    """
    Function to scrape some data from given url
    """
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')

    data = {'event_title': soup.find('h1').text.lower()}
    data['event_date'] = soup.find('li', {'class': 'header'}).text.split()[1]

    return data

# Test URL 
url = 'https://www.tapology.com/fightcenter/events/67412-ufc-on-espn-33'

# First try returns the correct info
first = scrape_webpage(url)
print(first)   
# {'event_date': '05.16.2020', 'event_title': 'ufc fight night: overeem vs. harris'}

# A second try changing nothing returns wrong info
second = scrape_webpage(url)
print(second)
# {'event_date': '06.20.2020', 'event_title': 'efm 3'}

# A third try also fails to retrieve the correct data
third = scrape_webpage(url)
print(third)
# {'event_date': '10.05.2010', 'event_title': 'bystriy fight club 1'}

因此这种行为在没有明显逻辑的情况下重复出现。另外值得一提的是,我正在使用 Google Colab 来执行此操作。如果我尝试抓取 url 的列表,只有第一个获取正确的数据(并且只有在第一次尝试时),其余 returns 个数据来自随机 url。所以问题是“如何解决此问题?

你应该模仿一个真正的浏览器,至少可以通过用户代理来完成:

def scrape_webpage(url):
    #s = session()
    headers = {
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4103.61 Safari/537.36"
    }
    """
    Function to scrape some data from given url
    """
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.content, 'html.parser')

    data = {'event_title': soup.find('h1').text.lower()}
    data['event_date'] = soup.find('li', {'class': 'header'}).text.split()[1]

    return data

# Test URL 
url = 'https://www.tapology.com/fightcenter/events/67412-ufc-on-espn-33'

for x in range(10):
    # A second try changing nothing returns wrong info
    second = scrape_webpage(url)
    print(second)

输出:

{'event_title': 'ufc fight night: overeem vs. harris', 'event_date': '05.16.2020'}
{'event_title': 'ufc fight night: overeem vs. harris', 'event_date': '05.16.2020'}
{'event_title': 'ufc fight night: overeem vs. harris', 'event_date': '05.16.2020'}
{'event_title': 'ufc fight night: overeem vs. harris', 'event_date': '05.16.2020'}
{'event_title': 'ufc fight night: overeem vs. harris', 'event_date': '05.16.2020'}
{'event_title': 'ufc fight night: overeem vs. harris', 'event_date': '05.16.2020'}
{'event_title': 'ufc fight night: overeem vs. harris', 'event_date': '05.16.2020'}
{'event_title': 'ufc fight night: overeem vs. harris', 'event_date': '05.16.2020'}
{'event_title': 'ufc fight night: overeem vs. harris', 'event_date': '05.16.2020'}
{'event_title': 'ufc fight night: overeem vs. harris', 'event_date': '05.16.2020'}