我如何使用 scrapy 抓取 table 的所有数据

How I can crawl a table's all data with scrapy

我是 Scrapy 新手。我只是听了一门课程,确实编写了代码并以某种方式理解了它。 我面临的问题是仅缓存第一个 table 的数据。

我确实试过了 这是代码。

from ast import parse
from fileinput import filename
import scrapy

class PostsSpider(scrapy.Spider):
    name = "posts"

    start_urls= [
        'https://publicholidays.com.bd/2022-dates/'
    ]
    
    def parse(self, response):
        for post in response.css('table'):
            yield{
                'date' : post.css('td::text').getall()[0],
                'day' : post.css('td::text' ).getall()[1],
                'event' : post.css('tr td a::text').getall()[0]
            }

当我抓取这个时:

{"date": "21 Feb", "day": "Mon", "event": "Shaheed Day"}

如何获取table的所有数据?

css 元素选择有点问题。现在它工作正常。您可以 运行 代码。

from ast import parse
from fileinput import filename
import scrapy
from scrapy.crawler import CrawlerProcess

class PostsSpider(scrapy.Spider):
    name = "posts"

    start_urls= ['https://publicholidays.com.bd/2022-dates']
    
    def parse(self, response):
        for post in response.css('.publicholidays tbody tr'):
            yield{
                'date' : post.css('td:nth-child(1)::text').get(),
                'day' : post.css('td:nth-child(2)::text' ).get(),
                'event' : post.css('td:nth-child(3) a::text').get() or post.css('td:nth-child(3) span::text').get()
            }
if __name__ == "__main__":
    process = CrawlerProcess()
    process.crawl(PostsSpider)
    process.start()

输出:

{'date': '21 Feb', 'day': 'Mon', 'event': 'Shaheed Day'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '17 Mar', 'day': 'Thu', 'event': "Sheikh Mujibur Rahman's Birthday"}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '18 Mar', 'day': 'Fri', 'event': 'Shab e-Barat'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '26 Mar', 'day': 'Sat', 'event': 'Independence Day'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '14 Apr', 'day': 'Thu', 'event': 'Bengali New Year'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '28 Apr', 'day': 'Thu', 'event': 'Laylat al-Qadr'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '29 Apr', 'day': 'Fri', 'event': 'Jumatul Bidah'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '1 May', 'day': 'Sun', 'event': 'May Day'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '2 May', 'day': 'Mon', 'event': 'Eid ul-Fitr Holiday'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '3 May', 'day': 'Tue', 'event': 'Eid ul-Fitr'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '4 May', 'day': 'Wed', 'event': 'Eid ul-Fitr Holiday'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '16 May', 'day': 'Mon', 'event': 'Buddha Purnima'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '9 Jul', 'day': 'Sat', 'event': 'Eid ul-Adha Holiday'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '\n', 'day': None, 'event': None}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '10 Jul', 'day': 'Sun', 'event': 'Eid ul-Adha'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '11 Jul', 'day': 'Mon', 'event': 'Eid ul-Adha Holiday'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '9 Aug', 'day': 'Tue', 'event': 'Ashura'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '15 Aug', 'day': 'Mon', 'event': 'National Mourning Day'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '19 Aug', 'day': 'Fri', 'event': 'Shuba Janmashtami'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '5 Oct', 'day': 'Wed', 'event': 'Vijaya Dashami'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '9 Oct', 'day': 'Sun', 'event': 'Eid-e-Milad un-Nabi'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '16 Dec', 'day': 'Fri', 'event': 'Victory Day'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
import scrapy

class QuestionSpider(scrapy.Spider):
    name = 'question'
    allowed_domains = ['publicholidays.com.bd']
    start_urls = ['https://publicholidays.com.bd/2022-dates/']

def parse(self, response):
    item = {}
    for a in response.xpath("//table//tr")[:-1]:
        if a.xpath("./td[1]/text()").get() != '\n':
            item["date"] = a.xpath("./td[1]/text()").get()
            item["day"] = a.xpath("./td[2]/text()").get()
            if a.xpath(".//a/text()").get() is not None:
                item["holiday"] = a.xpath(".//a/text()").get()
            else:
                item["holiday"] = a.xpath(".//span/text()").get()

            print(item)