无法通过 scrapy 从网站提取数据,但可以使用 xpath 帮助程序扩展
Unable to extract data from website via scrapy but works with xpath helper extension
所以我创建了一个 scrapy 蜘蛛来从网站中提取数据,例如
https://www.sportstoto.com.my/result_print.asp?drawNo=5291/21
这是我的代码,
import scrapy
from totoprintasp.items import Result
def generate_start_urls():
drawNums = ['5291/21']
return ['https://www.sportstoto.com.my/result_print.asp?drawNo={}'.format(drawNum) for drawNum in drawNums]
class TotoprintSpider(scrapy.Spider):
name = 'totoprint'
allowed_domains = ['www.sportstoto.com.my/result_print.asp']
start_urls = generate_start_urls()
download_delay = 3
def parse(self, response):
# print(response.body)
items = []
# print(response.body)
for each in response.xpath("/html/body/div/center/table/tbody"):
item = Result()
drawDate = each.xpath(
"tr[2]/td/div/table/tbody/tr/td[1]/span/font/b/text()").extract()
drawNo = each.xpath(
"tr[2]/td/div/table/tbody/tr/td[2]/span/b/font/text()").extract()
gameType = each.xpath(
"tr[4]/td/span/font/text()").extract()
firstPrize = each.xpath(
"tr[5]/td/table[1]/tbody/tr[2]/td[1]/span/b/font/text()").extract()
item['drawDate'] = drawDate
item['drawNo'] = drawNo
item['gameType'] = gameType
item['firstPrize'] = firstPrize
items.append(item)
yield item
它没有提取任何东西。我是运行命令,
scrapy runspider totoprint.py
并设置了值,
FEED_URI = 'results.json'
FEED_FORMAT = 'json'
在我的 settings.py
文件中
所以结果应该写入json文件
有趣的是什么都没有出现,也没有任何提取物。我尝试了不同的变体,甚至将 .extract()
更改为 .get()
XPath 的工作原理与我在我的 chrome 浏览器中的 XPath 帮助程序扩展中尝试过的一样。
enter image description here
感谢一些帮助或建议。
我重写了你的脚本,但你必须根据自己的项目重新修改它。这里的问题是您正在寻找 1 tbody
和他们的 1 child。但是有很多 tbody
.
据我了解,您希望 gameType 为列表,其他为字符串。我得到以下输出:
|------------------|-----------------|----------------------------------------|------------|
| drawDate | drawNo | gameType | firstPrize |
|------------------|-----------------|----------------------------------------|------------|
| Date:30/05/2021 | DrawNo. 5291/21 | TOTO 4D,TOTO 4D ZODIAC,TOTO 5D,TOTO 6D | 4800 |
|------------------|-----------------|----------------------------------------|------------|
顺便说一下,您不必为每个 URL 执行 for 循环。每个 URL 一个一个地调用解析。所以这是脚本:
import scrapy
def generate_start_urls():
drawNums = ['5291/21']
return ['https://www.sportstoto.com.my/result_print.asp?drawNo={}'.format(drawNum) for drawNum in drawNums]
class TotoprintSpider(scrapy.Spider):
name = 'totoprint'
allowed_domains = ['www.sportstoto.com.my/result_print.asp']
start_urls = generate_start_urls()
download_delay = 3
custom_settings = {
"ROBOTSTXT_OBEY":False, #You have to close the robotstxt rule because they are not letting you in.
}
def parse(self, response):
drawDate,drawNo = response.xpath('//*[@class="dataDD"]//text()').extract() #Both have same class so you can scrape them together
gameType = response.xpath('//*[@class="tit4D"]//text()').extract()
firstPrize = response.xpath('(//*[@class="dataResultA"])[1]//text()').get() #According to your scrit you want just first price because of that I write the xpath with [1]
yield {
'drawDate':drawDate.replace("\t","").replace("\n","").replace("\r",""), #There was some issue about t,n,r I delete simply with replace
"drawNo":drawNo.replace("\t","").replace("\n","").replace("\r",""),
"gameType":gameType,
"firstPrize":firstPrize
}
我觉得我写的剧本就是你想要的
所以我创建了一个 scrapy 蜘蛛来从网站中提取数据,例如 https://www.sportstoto.com.my/result_print.asp?drawNo=5291/21
这是我的代码,
import scrapy
from totoprintasp.items import Result
def generate_start_urls():
drawNums = ['5291/21']
return ['https://www.sportstoto.com.my/result_print.asp?drawNo={}'.format(drawNum) for drawNum in drawNums]
class TotoprintSpider(scrapy.Spider):
name = 'totoprint'
allowed_domains = ['www.sportstoto.com.my/result_print.asp']
start_urls = generate_start_urls()
download_delay = 3
def parse(self, response):
# print(response.body)
items = []
# print(response.body)
for each in response.xpath("/html/body/div/center/table/tbody"):
item = Result()
drawDate = each.xpath(
"tr[2]/td/div/table/tbody/tr/td[1]/span/font/b/text()").extract()
drawNo = each.xpath(
"tr[2]/td/div/table/tbody/tr/td[2]/span/b/font/text()").extract()
gameType = each.xpath(
"tr[4]/td/span/font/text()").extract()
firstPrize = each.xpath(
"tr[5]/td/table[1]/tbody/tr[2]/td[1]/span/b/font/text()").extract()
item['drawDate'] = drawDate
item['drawNo'] = drawNo
item['gameType'] = gameType
item['firstPrize'] = firstPrize
items.append(item)
yield item
它没有提取任何东西。我是运行命令,
scrapy runspider totoprint.py
并设置了值,
FEED_URI = 'results.json'
FEED_FORMAT = 'json'
在我的 settings.py
文件中
所以结果应该写入json文件
有趣的是什么都没有出现,也没有任何提取物。我尝试了不同的变体,甚至将 .extract()
更改为 .get()
XPath 的工作原理与我在我的 chrome 浏览器中的 XPath 帮助程序扩展中尝试过的一样。
enter image description here
感谢一些帮助或建议。
我重写了你的脚本,但你必须根据自己的项目重新修改它。这里的问题是您正在寻找 1 tbody
和他们的 1 child。但是有很多 tbody
.
据我了解,您希望 gameType 为列表,其他为字符串。我得到以下输出:
|------------------|-----------------|----------------------------------------|------------|
| drawDate | drawNo | gameType | firstPrize |
|------------------|-----------------|----------------------------------------|------------|
| Date:30/05/2021 | DrawNo. 5291/21 | TOTO 4D,TOTO 4D ZODIAC,TOTO 5D,TOTO 6D | 4800 |
|------------------|-----------------|----------------------------------------|------------|
顺便说一下,您不必为每个 URL 执行 for 循环。每个 URL 一个一个地调用解析。所以这是脚本:
import scrapy
def generate_start_urls():
drawNums = ['5291/21']
return ['https://www.sportstoto.com.my/result_print.asp?drawNo={}'.format(drawNum) for drawNum in drawNums]
class TotoprintSpider(scrapy.Spider):
name = 'totoprint'
allowed_domains = ['www.sportstoto.com.my/result_print.asp']
start_urls = generate_start_urls()
download_delay = 3
custom_settings = {
"ROBOTSTXT_OBEY":False, #You have to close the robotstxt rule because they are not letting you in.
}
def parse(self, response):
drawDate,drawNo = response.xpath('//*[@class="dataDD"]//text()').extract() #Both have same class so you can scrape them together
gameType = response.xpath('//*[@class="tit4D"]//text()').extract()
firstPrize = response.xpath('(//*[@class="dataResultA"])[1]//text()').get() #According to your scrit you want just first price because of that I write the xpath with [1]
yield {
'drawDate':drawDate.replace("\t","").replace("\n","").replace("\r",""), #There was some issue about t,n,r I delete simply with replace
"drawNo":drawNo.replace("\t","").replace("\n","").replace("\r",""),
"gameType":gameType,
"firstPrize":firstPrize
}
我觉得我写的剧本就是你想要的