无法通过 scrapy 从网站提取数据,但可以使用 xpath 帮助程序扩展

Unable to extract data from website via scrapy but works with xpath helper extension

所以我创建了一个 scrapy 蜘蛛来从网站中提取数据,例如 https://www.sportstoto.com.my/result_print.asp?drawNo=5291/21

这是我的代码,

    import scrapy
from totoprintasp.items import Result


def generate_start_urls():
    drawNums = ['5291/21']
    return ['https://www.sportstoto.com.my/result_print.asp?drawNo={}'.format(drawNum) for drawNum in drawNums]


class TotoprintSpider(scrapy.Spider):
    name = 'totoprint'
    allowed_domains = ['www.sportstoto.com.my/result_print.asp']
    start_urls = generate_start_urls()
    download_delay = 3

    def parse(self, response):
        # print(response.body)

        items = []
        # print(response.body)
        for each in response.xpath("/html/body/div/center/table/tbody"):
            item = Result()
            drawDate = each.xpath(
                "tr[2]/td/div/table/tbody/tr/td[1]/span/font/b/text()").extract() 
            drawNo = each.xpath(
                "tr[2]/td/div/table/tbody/tr/td[2]/span/b/font/text()").extract()
            gameType = each.xpath(
                "tr[4]/td/span/font/text()").extract()
            firstPrize = each.xpath(
                "tr[5]/td/table[1]/tbody/tr[2]/td[1]/span/b/font/text()").extract()

            item['drawDate'] = drawDate
            item['drawNo'] = drawNo
            item['gameType'] = gameType
            item['firstPrize'] = firstPrize
            items.append(item)
            yield item

它没有提取任何东西。我是运行命令, scrapy runspider totoprint.py 并设置了值,

FEED_URI = 'results.json'

FEED_FORMAT = 'json'

在我的 settings.py 文件中

所以结果应该写入json文件

有趣的是什么都没有出现,也没有任何提取物。我尝试了不同的变体,甚至将 .extract() 更改为 .get()

XPath 的工作原理与我在我的 chrome 浏览器中的 XPath 帮助程序扩展中尝试过的一样。

enter image description here

感谢一些帮助或建议。

我重写了你的脚本,但你必须根据自己的项目重新修改它。这里的问题是您正在寻找 1 tbody 和他们的 1 child。但是有很多 tbody.

据我了解,您希望 gameType 为列表,其他为字符串。我得到以下输出:

|------------------|-----------------|----------------------------------------|------------|
| drawDate         | drawNo          | gameType                               | firstPrize |
|------------------|-----------------|----------------------------------------|------------|
| Date:30/05/2021  | DrawNo. 5291/21 | TOTO 4D,TOTO 4D ZODIAC,TOTO 5D,TOTO 6D | 4800       |
|------------------|-----------------|----------------------------------------|------------|

顺便说一下,您不必为每个 URL 执行 for 循环。每个 URL 一个一个地调用解析。所以这是脚本:

import scrapy

def generate_start_urls():
    drawNums = ['5291/21']
    return ['https://www.sportstoto.com.my/result_print.asp?drawNo={}'.format(drawNum) for drawNum in drawNums]

class TotoprintSpider(scrapy.Spider):
    name = 'totoprint'
    allowed_domains = ['www.sportstoto.com.my/result_print.asp']
    start_urls = generate_start_urls()
    download_delay = 3
    custom_settings = { 
        "ROBOTSTXT_OBEY":False, #You have to close the robotstxt rule because they are not letting you in.
    }

    def parse(self, response):
        drawDate,drawNo = response.xpath('//*[@class="dataDD"]//text()').extract() #Both have same class so you can scrape them together
        gameType = response.xpath('//*[@class="tit4D"]//text()').extract()
        firstPrize = response.xpath('(//*[@class="dataResultA"])[1]//text()').get() #According to your scrit you want just first price because of that I write the xpath with [1]
        yield {
            'drawDate':drawDate.replace("\t","").replace("\n","").replace("\r",""), #There was some issue about t,n,r I delete simply with replace
            "drawNo":drawNo.replace("\t","").replace("\n","").replace("\r",""),
            "gameType":gameType,
            "firstPrize":firstPrize
        }

我觉得我写的剧本就是你想要的