使用 BeautifulSoup 以不变 URL 抓取多个页面
Scraping multiple pages with an unchanging URL using BeautifulSoup
我正在使用 Beautiful Soup
从非英语网站提取数据。现在我的代码只从关键字搜索中提取前十个结果。该网站旨在通过“更多”按钮访问其他结果(有点像无限滚动,但您必须继续点击更多才能获得下一组结果)。当我点击“更多”时,URL 没有改变,所以我不能每次都迭代不同的 URL。
我真的很想在两件事上得到帮助。
- 修改下面的代码,以便我可以从所有页面获取数据,而不仅仅是前 10 个结果
- 插入一个定时器功能,这样服务器就不会阻止我
我正在添加一张“更多”按钮的照片,因为它不是英文的。它位于 page 末尾的蓝色文本中。
import requests, csv, os
from bs4 import BeautifulSoup
from time import strftime, sleep
# make a GET request (requests.get("URL")) and store the response in a response object (req)
responsePA = requests.get('https://www.prothomalo.com/search?q=%E0%A6%A7%E0%A6%B0%E0%A7%8D%E0%A6%B7%E0%A6%A3')
# read the content of the server’s response
rawPagePA = responsePA.text
soupPA = BeautifulSoup(rawPagePA)
# take a look
print (soupPA.prettify())
urlsPA = [] #creating empty list to store URLs
for item in soupPA.find_all("div", class_= "customStoryCard9-m__story-data__2qgWb"): #first part of loop selects all items with class=field-title
aTag = item.find("a") #extracting elements containing 'a' tags
urlsPA.append(aTag.attrs["href"])
print(urlsPA)
#Below I'm getting the data from each of the urls and storing them in a list
PAlist=[]
for link in urlsPA:
specificpagePA=requests.get(link) #making a get request and stores the response in an object
rawAddPagePA=specificpagePA.text # read the content of the server’s response
PASoup2=BeautifulSoup(rawAddPagePA) # parse the response into an HTML tree
PAcontent=PASoup2.find_all(class_=["story-element story-element-text", "time-social-share-wrapper storyPageMetaData-m__time-social-share-wrapper__2-RAX", "headline headline-type-9 story-headline bn-story-headline headline-m__headline__3vaq9 headline-m__headline-type-9__3gT8S", "contributor-name contributor-m__contributor-name__1-593"])
#print(PAcontent)
PAlist.append(PAcontent)
您实际上并不需要 Selenium。
按钮发送以下 GET 请求:
https://www.prothomalo.com/api/v1/advanced-search?fields=headline,subheadline,slug,url,hero-image-s3-key,hero-image-caption,hero-image-metadata,first-published-at,last-published-at,alternative,published-at,authors,author-name,author-id,sections,story-template,metadata,tags,cards&offset=10&limit=6&q=ধর্ষণ
重要的部分是末尾的“offset=10&limit=6”,随后单击该按钮只会将该偏移量增加 6。
得到
data from all of the pages
不会工作,因为似乎有很多而且我没有看到确定有多少的选项。所以你最好选择一个号码并请求,直到你有那么多链接。
作为此请求 returns JSON,您最好还是解析它,而不是将 HTML 提供给 BeautifulSoup。
看看那个:
import requests
import json
s = requests.Session()
term = 'ধর্ষণ'
count = 20
# Make GET-Request
r = s.get(
'https://www.prothomalo.com/api/v1/advanced-search',
params={
'offset': 0,
'limit': count,
'q': term
}
)
# Read response text (a JSON file)
info = json.loads(r.text)
# Loop over items
urls = [item['url'] for item in info['items']]
print(urls)
此returns以下列表:
['https://www.prothomalo.com/world/asia/পাকিস্তানে-সন্তানদের-সামনে-মাকে-ধর্ষণের-মামলায়-দুজনের-মৃত্যুদণ্ড', 'https://www.prothomalo.com/bangladesh/district/খাবার-দেওয়ার-কথা-বদলে-ধর্ষণ-অবসরপ্রাপ্ত-শিক্ষকের-বিরুদ্ধে-মামলা', 'https://www.prothomalo.com/bangladesh/district/জয়পুরহাটে-অপহরণ-ও-ধর্ষণ-মামলায়-যুবকের-যাবজ্জীবন-কারাদণ্ড', 'https://www.prothomalo.com/bangladesh/district/কিশোরীকে-ধর্ষণ-মামলায়-যুবক-গ্রেপ্তার', 'https://www.prothomalo.com/bangladesh/সুবর্ণচরে-এত-ধর্ষণ-কেন', 'https://www.prothomalo.com/bangladesh/district/১২-বছরের-ছেলেকে-ধর্ষণ-মামলায়-একজন-গ্রেপ্তার', 'https://www.prothomalo.com/bangladesh/district/ভালো-পাত্রের-সঙ্গে-বিয়ে-দেওয়ার-কথা-বলে-কিশোরীকে-ধর্ষণ-গ্রেপ্তার-১', 'https://www.prothomalo.com/bangladesh/district/সখীপুরে-দুই-শিশুকে-ধর্ষণ-মামলার-আসামিকে-গ্রেপ্তারের-দাবিতে-মানববন্ধন', 'https://www.prothomalo.com/bangladesh/district/বগুড়ায়-ছাত্রী-ধর্ষণ-মামলায়-তুফান-সরকারের-জামিন-বাতিল', 'https://www.prothomalo.com/world/india/ধর্ষণ-নিয়ে-মন্তব্যের-জের-ভারতের-প্রধান-বিচারপতির-পদত্যাগ-দাবি', 'https://www.prothomalo.com/bangladesh/district/ফুলগাজীতে-ধর্ষণ-মামলায়-অভিযুক্ত-ইউপি-চেয়ারম্যান-বরখাস্ত', 'https://www.prothomalo.com/bangladesh/district/ধুনটে-ধর্ষণ-মামলায়-ছাত্রলীগ-নেতা-গ্রেপ্তার', 'https://www.prothomalo.com/bangladesh/district/নোয়াখালীতে-কিশোরীকে-ধর্ষণ-ভিডিও-ধারণ-ও-অপহরণের-অভিযোগে-গ্রেপ্তার-২', 'https://www.prothomalo.com/bangladesh/district/বাবার-সঙ্গে-দেখা-করানোর-কথা-বলে-স্কুলছাত্রীকে-ধর্ষণ', 'https://www.prothomalo.com/opinion/column/ধর্ষণ-ঠেকাতে-প্রযুক্তির-ব্যবহার', 'https://www.prothomalo.com/world/asia/পার্লামেন্টের-মধ্যে-ধর্ষণ-প্রধানমন্ত্রীর-ক্ষমা-প্রার্থনা', 'https://www.prothomalo.com/bangladesh/district/তাবিজ-দেওয়ার-কথা-বলে-গৃহবধূকে-ধর্ষণ-কবিরাজ-আটক', 'https://www.prothomalo.com/bangladesh/district/আদালত-প্রাঙ্গণে-বিয়ে-করে-জামিন-পেলেন-ধর্ষণ-মামলার-আসামি', 'https://www.prothomalo.com/bangladesh/district/কিশোরীকে-দল-বেঁধে-ধর্ষণ-ও-ভিডিও-ধারণ-গ্রেপ্তার-৩', 'https://www.prothomalo.com/bangladesh/district/ধর্ষণ-মামলায়-সহকারী-স্টেশনমাস্টার-গ্রেপ্তার']
通过调整 count 可以设置要检索的 url(文章)的数量,term 是搜索词。
requests.Session-对象用于具有一致的 cookie。
如果您有任何问题,请随时提出。
编辑:
以防万一你想知道我是如何找到哪个 GET-request
正在通过单击按钮发送:我去了 网络
Analysis-来自我的浏览器 (Firefox) 开发者工具的选项卡,
单击按钮,观察正在发送哪些请求
复制了 URL:
params参数的另一种解释
.get-函数:它包含(在 python-字典格式中)通常会附加到 URL 之后的所有参数
问号。所以
requests.get('https://www.prothomalo.com/search?q=%E0%A6%A7%E0%A6%B0%E0%A7%8D%E0%A6%B7%E0%A6%A3')
可以写成
requests.get('https://www.prothomalo.com/search', params={'q': 'ধর্ষণ'})
这让它看起来更漂亮,你可以真正看到什么
你正在搜索,因为它是用 unicode 而不是
已经为 URL.
编码
编辑:
如果脚本开始返回一个空的 JSON-file,因此没有 URLs,您可能必须像这样设置一个 User-Agent(我使用了一个对于 Firefox,但任何浏览器都应该没问题):
s.headers.update({
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:87.0) '
'Gecko/20100101 Firefox/87.0'
})
只需将该代码放在初始化会话对象的行下方(s = ...
行)。
User-Agent 告诉网站什么样的程序正在访问他们的数据。
请始终牢记,服务器还有其他事情要做,并且网页有其他优先级,而不是将数千个搜索结果发送给一个人,因此请尽量保持流量低。抓取 5000 URLs 很多,如果你真的 有 多次这样做,请在任何地方放置至少几秒钟的 sleep(...)
,然后再进行下一个请求(不仅仅是为了防止被阻止,而是为了对向您提供您请求的信息的人友好)。
你把睡眠放在哪里并不重要,因为你真正与服务器联系的唯一时间是 s.get(...)
行。
这是用 bs4 添加硒的地方。要添加站点加载的点击然后获取页面内容。
您可以从 link
下载 geckodriver
模拟代码如下所示,
from bs4 import BeautifulSoup
from selenium import webdriver
url = "https://www.prothomalo.com/search?q=%E0%A6%A7%E0%A6%B0%E0%A7%8D%E0%A6%B7%E0%A6%A3"
driver = webdriver.Firefox(executable_path=r'geckodriver.exe')
driver.get(url)
# You need to iterate over this with a loop on how many times you want to click more,
#do remember if it takes time to fetch the data try adding time.sleep() to wait for the page to load
driver.find_element_by_css_selector('{class-name}').click()
# Then you just get the page content
soup = BeautifulSoup(driver.page_source, 'html')
# now you have the content loaded with beautifulsoap and can manipulate it as you were doing previously
{YOUR CODE}
我正在使用 Beautiful Soup
从非英语网站提取数据。现在我的代码只从关键字搜索中提取前十个结果。该网站旨在通过“更多”按钮访问其他结果(有点像无限滚动,但您必须继续点击更多才能获得下一组结果)。当我点击“更多”时,URL 没有改变,所以我不能每次都迭代不同的 URL。
我真的很想在两件事上得到帮助。
- 修改下面的代码,以便我可以从所有页面获取数据,而不仅仅是前 10 个结果
- 插入一个定时器功能,这样服务器就不会阻止我
我正在添加一张“更多”按钮的照片,因为它不是英文的。它位于 page 末尾的蓝色文本中。
import requests, csv, os
from bs4 import BeautifulSoup
from time import strftime, sleep
# make a GET request (requests.get("URL")) and store the response in a response object (req)
responsePA = requests.get('https://www.prothomalo.com/search?q=%E0%A6%A7%E0%A6%B0%E0%A7%8D%E0%A6%B7%E0%A6%A3')
# read the content of the server’s response
rawPagePA = responsePA.text
soupPA = BeautifulSoup(rawPagePA)
# take a look
print (soupPA.prettify())
urlsPA = [] #creating empty list to store URLs
for item in soupPA.find_all("div", class_= "customStoryCard9-m__story-data__2qgWb"): #first part of loop selects all items with class=field-title
aTag = item.find("a") #extracting elements containing 'a' tags
urlsPA.append(aTag.attrs["href"])
print(urlsPA)
#Below I'm getting the data from each of the urls and storing them in a list
PAlist=[]
for link in urlsPA:
specificpagePA=requests.get(link) #making a get request and stores the response in an object
rawAddPagePA=specificpagePA.text # read the content of the server’s response
PASoup2=BeautifulSoup(rawAddPagePA) # parse the response into an HTML tree
PAcontent=PASoup2.find_all(class_=["story-element story-element-text", "time-social-share-wrapper storyPageMetaData-m__time-social-share-wrapper__2-RAX", "headline headline-type-9 story-headline bn-story-headline headline-m__headline__3vaq9 headline-m__headline-type-9__3gT8S", "contributor-name contributor-m__contributor-name__1-593"])
#print(PAcontent)
PAlist.append(PAcontent)
您实际上并不需要 Selenium。
按钮发送以下 GET 请求:
https://www.prothomalo.com/api/v1/advanced-search?fields=headline,subheadline,slug,url,hero-image-s3-key,hero-image-caption,hero-image-metadata,first-published-at,last-published-at,alternative,published-at,authors,author-name,author-id,sections,story-template,metadata,tags,cards&offset=10&limit=6&q=ধর্ষণ
重要的部分是末尾的“offset=10&limit=6”,随后单击该按钮只会将该偏移量增加 6。
得到
data from all of the pages
不会工作,因为似乎有很多而且我没有看到确定有多少的选项。所以你最好选择一个号码并请求,直到你有那么多链接。
作为此请求 returns JSON,您最好还是解析它,而不是将 HTML 提供给 BeautifulSoup。
看看那个:
import requests
import json
s = requests.Session()
term = 'ধর্ষণ'
count = 20
# Make GET-Request
r = s.get(
'https://www.prothomalo.com/api/v1/advanced-search',
params={
'offset': 0,
'limit': count,
'q': term
}
)
# Read response text (a JSON file)
info = json.loads(r.text)
# Loop over items
urls = [item['url'] for item in info['items']]
print(urls)
此returns以下列表:
['https://www.prothomalo.com/world/asia/পাকিস্তানে-সন্তানদের-সামনে-মাকে-ধর্ষণের-মামলায়-দুজনের-মৃত্যুদণ্ড', 'https://www.prothomalo.com/bangladesh/district/খাবার-দেওয়ার-কথা-বদলে-ধর্ষণ-অবসরপ্রাপ্ত-শিক্ষকের-বিরুদ্ধে-মামলা', 'https://www.prothomalo.com/bangladesh/district/জয়পুরহাটে-অপহরণ-ও-ধর্ষণ-মামলায়-যুবকের-যাবজ্জীবন-কারাদণ্ড', 'https://www.prothomalo.com/bangladesh/district/কিশোরীকে-ধর্ষণ-মামলায়-যুবক-গ্রেপ্তার', 'https://www.prothomalo.com/bangladesh/সুবর্ণচরে-এত-ধর্ষণ-কেন', 'https://www.prothomalo.com/bangladesh/district/১২-বছরের-ছেলেকে-ধর্ষণ-মামলায়-একজন-গ্রেপ্তার', 'https://www.prothomalo.com/bangladesh/district/ভালো-পাত্রের-সঙ্গে-বিয়ে-দেওয়ার-কথা-বলে-কিশোরীকে-ধর্ষণ-গ্রেপ্তার-১', 'https://www.prothomalo.com/bangladesh/district/সখীপুরে-দুই-শিশুকে-ধর্ষণ-মামলার-আসামিকে-গ্রেপ্তারের-দাবিতে-মানববন্ধন', 'https://www.prothomalo.com/bangladesh/district/বগুড়ায়-ছাত্রী-ধর্ষণ-মামলায়-তুফান-সরকারের-জামিন-বাতিল', 'https://www.prothomalo.com/world/india/ধর্ষণ-নিয়ে-মন্তব্যের-জের-ভারতের-প্রধান-বিচারপতির-পদত্যাগ-দাবি', 'https://www.prothomalo.com/bangladesh/district/ফুলগাজীতে-ধর্ষণ-মামলায়-অভিযুক্ত-ইউপি-চেয়ারম্যান-বরখাস্ত', 'https://www.prothomalo.com/bangladesh/district/ধুনটে-ধর্ষণ-মামলায়-ছাত্রলীগ-নেতা-গ্রেপ্তার', 'https://www.prothomalo.com/bangladesh/district/নোয়াখালীতে-কিশোরীকে-ধর্ষণ-ভিডিও-ধারণ-ও-অপহরণের-অভিযোগে-গ্রেপ্তার-২', 'https://www.prothomalo.com/bangladesh/district/বাবার-সঙ্গে-দেখা-করানোর-কথা-বলে-স্কুলছাত্রীকে-ধর্ষণ', 'https://www.prothomalo.com/opinion/column/ধর্ষণ-ঠেকাতে-প্রযুক্তির-ব্যবহার', 'https://www.prothomalo.com/world/asia/পার্লামেন্টের-মধ্যে-ধর্ষণ-প্রধানমন্ত্রীর-ক্ষমা-প্রার্থনা', 'https://www.prothomalo.com/bangladesh/district/তাবিজ-দেওয়ার-কথা-বলে-গৃহবধূকে-ধর্ষণ-কবিরাজ-আটক', 'https://www.prothomalo.com/bangladesh/district/আদালত-প্রাঙ্গণে-বিয়ে-করে-জামিন-পেলেন-ধর্ষণ-মামলার-আসামি', 'https://www.prothomalo.com/bangladesh/district/কিশোরীকে-দল-বেঁধে-ধর্ষণ-ও-ভিডিও-ধারণ-গ্রেপ্তার-৩', 'https://www.prothomalo.com/bangladesh/district/ধর্ষণ-মামলায়-সহকারী-স্টেশনমাস্টার-গ্রেপ্তার']
通过调整 count 可以设置要检索的 url(文章)的数量,term 是搜索词。
requests.Session-对象用于具有一致的 cookie。
如果您有任何问题,请随时提出。
编辑:
以防万一你想知道我是如何找到哪个 GET-request 正在通过单击按钮发送:我去了 网络 Analysis-来自我的浏览器 (Firefox) 开发者工具的选项卡, 单击按钮,观察正在发送哪些请求 复制了 URL:
params参数的另一种解释 .get-函数:它包含(在 python-字典格式中)通常会附加到 URL 之后的所有参数 问号。所以
requests.get('https://www.prothomalo.com/search?q=%E0%A6%A7%E0%A6%B0%E0%A7%8D%E0%A6%B7%E0%A6%A3')
可以写成
requests.get('https://www.prothomalo.com/search', params={'q': 'ধর্ষণ'})
这让它看起来更漂亮,你可以真正看到什么 你正在搜索,因为它是用 unicode 而不是 已经为 URL.
编码
编辑:
如果脚本开始返回一个空的 JSON-file,因此没有 URLs,您可能必须像这样设置一个 User-Agent(我使用了一个对于 Firefox,但任何浏览器都应该没问题):
s.headers.update({
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:87.0) '
'Gecko/20100101 Firefox/87.0'
})
只需将该代码放在初始化会话对象的行下方(s = ...
行)。
User-Agent 告诉网站什么样的程序正在访问他们的数据。
请始终牢记,服务器还有其他事情要做,并且网页有其他优先级,而不是将数千个搜索结果发送给一个人,因此请尽量保持流量低。抓取 5000 URLs 很多,如果你真的 有 多次这样做,请在任何地方放置至少几秒钟的 sleep(...)
,然后再进行下一个请求(不仅仅是为了防止被阻止,而是为了对向您提供您请求的信息的人友好)。
你把睡眠放在哪里并不重要,因为你真正与服务器联系的唯一时间是 s.get(...)
行。
这是用 bs4 添加硒的地方。要添加站点加载的点击然后获取页面内容。
您可以从 link
下载 geckodriver模拟代码如下所示,
from bs4 import BeautifulSoup
from selenium import webdriver
url = "https://www.prothomalo.com/search?q=%E0%A6%A7%E0%A6%B0%E0%A7%8D%E0%A6%B7%E0%A6%A3"
driver = webdriver.Firefox(executable_path=r'geckodriver.exe')
driver.get(url)
# You need to iterate over this with a loop on how many times you want to click more,
#do remember if it takes time to fetch the data try adding time.sleep() to wait for the page to load
driver.find_element_by_css_selector('{class-name}').click()
# Then you just get the page content
soup = BeautifulSoup(driver.page_source, 'html')
# now you have the content loaded with beautifulsoap and can manipulate it as you were doing previously
{YOUR CODE}