正在解析包含 HTML 内容的 'Load More' 响应
Parsing an 'Load More' response with HTML content
我正在尝试抓取位于下方 link 的伊斯坦布尔省公告部分中的每个内容,该部分在页面底部加载带有 'Load More' 的内容。从开发工具/网络,我检查了发送的 POST 请求的属性并相应地更新了 header 。响应显然不是 json,而是 html 代码。
我想生成已解析的 html 响应,但是当我抓取它时,它什么也没 return 并且永远停留在第一个请求上。 提前谢谢你。
你能解释一下我的代码有什么问题吗?我在这里检查了数十个问题,但无法解决问题。据我了解,它只是无法解析响应 html 但我不明白为什么。
ps:我已经热情地进入 Python 和抓取了 20 天。原谅我的无知。
import scrapy
class DuyurularSpider(scrapy.Spider):
name = 'duyurular'
allowed_domains = ['istanbul.gov.tr']
start_urls = ['http://istanbul.gov.tr/duyurular']
headerz = {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "en-US,en;q=0.9",
"Connection" : "keep-alive",
"Content-Length": "112",
"Content-Type": "application/json",
"Cookie" : "_ga=GA1.3.285027250.1638576047; _gid=GA1.3.363882495.1639180128; ASP.NET_SessionId=ijw1mmc5xrpiw2iz32hmqb3a; NSC_ESNS=3e8876df-bcc4-11b4-9678-e2abf1d948a7_2815152435_0584317866_00000000013933875891; _gat_gtag_UA_136413027_31=1",
"Host": "istanbul.gov.tr",
"Origin": "http://istanbul.gov.tr",
"Referer": "http://istanbul.gov.tr/duyurular",
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36",
"X-Requested-With": "XMLHttpRequest",
}
def parse(self, response):
url = 'http://istanbul.gov.tr/ISAYWebPart/Announcement/AnnouncementDahaFazlaYukle'
load_more = scrapy.Request(url, callback = self.parse_api, method = "POST", headers = self.headerz)
yield load_more
def parse_api(self, response):
raw_data = response.body
data = raw_data.xpath('//div[@class="ministry-announcements"]')
for bilgi in data:
gun = bilgi.xpath('//div[@class = "day"]/text()').extract_first() #day
ay = bilgi.xpath('//div[@class = "month"]/text()').extract_first() #month
metin = bilgi.xpath('//a[@class ="announce-text"]/text()').extract_first() #text
yield {'Ay:' : ay,
'Gün' : gun,
'Metin': metin,}
Result I encounter:
删除 Content-Length
,也永远不要将其包含在 headers 中。你也应该删除 cookie 并让 scrapy 处理它。
-
你需要知道什么时候停止,在本例中是一个空白页面。
在 bilgi.xpath
部分,您一遍又一遍地看到相同的行,因为您忘记了开头的一个点。
完整的工作代码:
import scrapy
import json
class DuyurularSpider(scrapy.Spider):
name = 'duyurular'
allowed_domains = ['istanbul.gov.tr']
start_urls = ['http://istanbul.gov.tr/duyurular']
page = 1
headers = {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "en-US,en;q=0.9",
"Connection": "keep-alive",
"Content-Type": "application/json",
"Host": "istanbul.gov.tr",
"Origin": "http://istanbul.gov.tr",
"Referer": "http://istanbul.gov.tr/duyurular",
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36",
"X-Requested-With": "XMLHttpRequest",
}
def parse(self, response):
url = 'http://istanbul.gov.tr/ISAYWebPart/Announcement/AnnouncementDahaFazlaYukle'
body = {
"ContentCount": "8",
"ContentTypeId": "D6mHJdtwBYsvtS2xCvXiww==",
"GosterimSekli": "1",
"OrderByAsc": "true",
"page": f"{str(self.page)}"
}
if response.body.strip(): # check if we get an empty page
load_more = scrapy.Request(url, method="POST", headers=self.headers, body=json.dumps(body))
yield load_more
self.page += 1
data = response.xpath('//div[@class="ministry-announcements"]')
for bilgi in data:
gun = bilgi.xpath('.//div[@class = "day"]/text()').extract_first() #day
ay = bilgi.xpath('.//div[@class = "month"]/text()').extract_first() #month
metin = bilgi.xpath('.//a[@class ="announce-text"]/text()').extract_first() #text
yield {
'Ay:': ay,
'Gün': gun,
'Metin': metin.strip(),
}
我正在尝试抓取位于下方 link 的伊斯坦布尔省公告部分中的每个内容,该部分在页面底部加载带有 'Load More' 的内容。从开发工具/网络,我检查了发送的 POST 请求的属性并相应地更新了 header 。响应显然不是 json,而是 html 代码。
我想生成已解析的 html 响应,但是当我抓取它时,它什么也没 return 并且永远停留在第一个请求上。 提前谢谢你。
你能解释一下我的代码有什么问题吗?我在这里检查了数十个问题,但无法解决问题。据我了解,它只是无法解析响应 html 但我不明白为什么。
ps:我已经热情地进入 Python 和抓取了 20 天。原谅我的无知。
import scrapy
class DuyurularSpider(scrapy.Spider):
name = 'duyurular'
allowed_domains = ['istanbul.gov.tr']
start_urls = ['http://istanbul.gov.tr/duyurular']
headerz = {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "en-US,en;q=0.9",
"Connection" : "keep-alive",
"Content-Length": "112",
"Content-Type": "application/json",
"Cookie" : "_ga=GA1.3.285027250.1638576047; _gid=GA1.3.363882495.1639180128; ASP.NET_SessionId=ijw1mmc5xrpiw2iz32hmqb3a; NSC_ESNS=3e8876df-bcc4-11b4-9678-e2abf1d948a7_2815152435_0584317866_00000000013933875891; _gat_gtag_UA_136413027_31=1",
"Host": "istanbul.gov.tr",
"Origin": "http://istanbul.gov.tr",
"Referer": "http://istanbul.gov.tr/duyurular",
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36",
"X-Requested-With": "XMLHttpRequest",
}
def parse(self, response):
url = 'http://istanbul.gov.tr/ISAYWebPart/Announcement/AnnouncementDahaFazlaYukle'
load_more = scrapy.Request(url, callback = self.parse_api, method = "POST", headers = self.headerz)
yield load_more
def parse_api(self, response):
raw_data = response.body
data = raw_data.xpath('//div[@class="ministry-announcements"]')
for bilgi in data:
gun = bilgi.xpath('//div[@class = "day"]/text()').extract_first() #day
ay = bilgi.xpath('//div[@class = "month"]/text()').extract_first() #month
metin = bilgi.xpath('//a[@class ="announce-text"]/text()').extract_first() #text
yield {'Ay:' : ay,
'Gün' : gun,
'Metin': metin,}
Result I encounter:
删除
Content-Length
,也永远不要将其包含在 headers 中。你也应该删除 cookie 并让 scrapy 处理它。你需要知道什么时候停止,在本例中是一个空白页面。
在
bilgi.xpath
部分,您一遍又一遍地看到相同的行,因为您忘记了开头的一个点。
完整的工作代码:
import scrapy
import json
class DuyurularSpider(scrapy.Spider):
name = 'duyurular'
allowed_domains = ['istanbul.gov.tr']
start_urls = ['http://istanbul.gov.tr/duyurular']
page = 1
headers = {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "en-US,en;q=0.9",
"Connection": "keep-alive",
"Content-Type": "application/json",
"Host": "istanbul.gov.tr",
"Origin": "http://istanbul.gov.tr",
"Referer": "http://istanbul.gov.tr/duyurular",
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36",
"X-Requested-With": "XMLHttpRequest",
}
def parse(self, response):
url = 'http://istanbul.gov.tr/ISAYWebPart/Announcement/AnnouncementDahaFazlaYukle'
body = {
"ContentCount": "8",
"ContentTypeId": "D6mHJdtwBYsvtS2xCvXiww==",
"GosterimSekli": "1",
"OrderByAsc": "true",
"page": f"{str(self.page)}"
}
if response.body.strip(): # check if we get an empty page
load_more = scrapy.Request(url, method="POST", headers=self.headers, body=json.dumps(body))
yield load_more
self.page += 1
data = response.xpath('//div[@class="ministry-announcements"]')
for bilgi in data:
gun = bilgi.xpath('.//div[@class = "day"]/text()').extract_first() #day
ay = bilgi.xpath('.//div[@class = "month"]/text()').extract_first() #month
metin = bilgi.xpath('.//a[@class ="announce-text"]/text()').extract_first() #text
yield {
'Ay:': ay,
'Gün': gun,
'Metin': metin.strip(),
}