如何绕过机器人检测并使用 python 抓取网站

Question

问题

我是网络抓取的新手，我试图创建一个抓取器来查看播放列表 link 并获取音乐和作者的列表。

但是网站一直拒绝我的连接，因为它认为我是一个机器人，所以我使用 UserAgent 创建一个假的用户代理字符串来尝试绕过过滤器。

有点用？但问题是，当你用浏览器访问该网站时，你可以看到播放列表的内容，但是当你试图用请求提取 html 代码时，播放列表的内容只是一大片空白 space.

我必须等待页面加载吗？或者有更强大的机器人过滤器？

我的代码

import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent

ua = UserAgent()

melon_site="http://kko.to/IU8zwNmjM"

headers = {'User-Agent' : ua.random}
result = requests.get(melon_site, headers = headers)


print(result.status_code)
src = result.content
soup = BeautifulSoup(src,'html.parser')
print(soup)

网站

Link

playlist link

html 我在使用请求时得到

html with blank space where the playlist was supposed to be

Answer 1

POINT TO REMEMBERS WHILE SCRAPING

1) 使用好的用户代理.. ua.random 可能会返回一个被服务器阻止的用户代理

2) 如果你正在做太多的抓取，限制你的抓取速度，使用 time.sleep() 这样服务器就不会被你的 IP 地址加载，否则它会阻止你。

3) 如果服务器阻止您尝试使用 Ip 旋转。

Answer 2

您想查看 this link 以获取您想要获取的内容。

下面的尝试应该会为您获取艺术家姓名和他们的歌曲名称。

import requests
from bs4 import BeautifulSoup

url = 'https://www.melon.com/mymusic/playlist/mymusicplaylistview_listSong.htm?plylstSeq=473505374'

r = requests.get(url,headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(r.text,"html.parser")
for item in soup.select("tr:has(#artistName)"):
    artist_name = item.select_one("#artistName > a[href*='goArtistDetail']")['title']
    song = item.select_one("a[href*='playSong']")['title']
    print(artist_name,song)

输出如下：

Martin Garrix - 페이지 이동 Used To Love (feat. Dean Lewis) 재생 - 새 창
Post Malone - 페이지 이동 Circles 재생 - 새 창
Marshmello - 페이지 이동 Here With Me 재생 - 새 창
Coldplay - 페이지 이동 Cry Cry Cry 재생 - 새 창

注意：您的 BeautifulSoup 版本应为 4.7.0 或更高版本，以便脚本支持伪选择器。

如何绕过机器人检测并使用 python 抓取网站

How to bypass bot detection and scrape a website using python

python

beautifulsoup

web-scraping

python-requests

botdetect

问题

我的代码

Link

html 我在使用请求时得到