我如何使用 scrapy 抓取 table 的所有数据
How I can crawl a table's all data with scrapy
我是 Scrapy 新手。我只是听了一门课程,确实编写了代码并以某种方式理解了它。
我面临的问题是仅缓存第一个 table 的数据。
我确实试过了
这是代码。
from ast import parse
from fileinput import filename
import scrapy
class PostsSpider(scrapy.Spider):
name = "posts"
start_urls= [
'https://publicholidays.com.bd/2022-dates/'
]
def parse(self, response):
for post in response.css('table'):
yield{
'date' : post.css('td::text').getall()[0],
'day' : post.css('td::text' ).getall()[1],
'event' : post.css('tr td a::text').getall()[0]
}
当我抓取这个时:
{"date": "21 Feb", "day": "Mon", "event": "Shaheed Day"}
如何获取table的所有数据?
css 元素选择有点问题。现在它工作正常。您可以 运行 代码。
from ast import parse
from fileinput import filename
import scrapy
from scrapy.crawler import CrawlerProcess
class PostsSpider(scrapy.Spider):
name = "posts"
start_urls= ['https://publicholidays.com.bd/2022-dates']
def parse(self, response):
for post in response.css('.publicholidays tbody tr'):
yield{
'date' : post.css('td:nth-child(1)::text').get(),
'day' : post.css('td:nth-child(2)::text' ).get(),
'event' : post.css('td:nth-child(3) a::text').get() or post.css('td:nth-child(3) span::text').get()
}
if __name__ == "__main__":
process = CrawlerProcess()
process.crawl(PostsSpider)
process.start()
输出:
{'date': '21 Feb', 'day': 'Mon', 'event': 'Shaheed Day'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '17 Mar', 'day': 'Thu', 'event': "Sheikh Mujibur Rahman's Birthday"}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '18 Mar', 'day': 'Fri', 'event': 'Shab e-Barat'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '26 Mar', 'day': 'Sat', 'event': 'Independence Day'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '14 Apr', 'day': 'Thu', 'event': 'Bengali New Year'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '28 Apr', 'day': 'Thu', 'event': 'Laylat al-Qadr'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '29 Apr', 'day': 'Fri', 'event': 'Jumatul Bidah'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '1 May', 'day': 'Sun', 'event': 'May Day'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '2 May', 'day': 'Mon', 'event': 'Eid ul-Fitr Holiday'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '3 May', 'day': 'Tue', 'event': 'Eid ul-Fitr'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '4 May', 'day': 'Wed', 'event': 'Eid ul-Fitr Holiday'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '16 May', 'day': 'Mon', 'event': 'Buddha Purnima'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '9 Jul', 'day': 'Sat', 'event': 'Eid ul-Adha Holiday'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '\n', 'day': None, 'event': None}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '10 Jul', 'day': 'Sun', 'event': 'Eid ul-Adha'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '11 Jul', 'day': 'Mon', 'event': 'Eid ul-Adha Holiday'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '9 Aug', 'day': 'Tue', 'event': 'Ashura'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '15 Aug', 'day': 'Mon', 'event': 'National Mourning Day'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '19 Aug', 'day': 'Fri', 'event': 'Shuba Janmashtami'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '5 Oct', 'day': 'Wed', 'event': 'Vijaya Dashami'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '9 Oct', 'day': 'Sun', 'event': 'Eid-e-Milad un-Nabi'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '16 Dec', 'day': 'Fri', 'event': 'Victory Day'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
import scrapy
class QuestionSpider(scrapy.Spider):
name = 'question'
allowed_domains = ['publicholidays.com.bd']
start_urls = ['https://publicholidays.com.bd/2022-dates/']
def parse(self, response):
item = {}
for a in response.xpath("//table//tr")[:-1]:
if a.xpath("./td[1]/text()").get() != '\n':
item["date"] = a.xpath("./td[1]/text()").get()
item["day"] = a.xpath("./td[2]/text()").get()
if a.xpath(".//a/text()").get() is not None:
item["holiday"] = a.xpath(".//a/text()").get()
else:
item["holiday"] = a.xpath(".//span/text()").get()
print(item)
我是 Scrapy 新手。我只是听了一门课程,确实编写了代码并以某种方式理解了它。 我面临的问题是仅缓存第一个 table 的数据。
我确实试过了 这是代码。
from ast import parse
from fileinput import filename
import scrapy
class PostsSpider(scrapy.Spider):
name = "posts"
start_urls= [
'https://publicholidays.com.bd/2022-dates/'
]
def parse(self, response):
for post in response.css('table'):
yield{
'date' : post.css('td::text').getall()[0],
'day' : post.css('td::text' ).getall()[1],
'event' : post.css('tr td a::text').getall()[0]
}
当我抓取这个时:
{"date": "21 Feb", "day": "Mon", "event": "Shaheed Day"}
如何获取table的所有数据?
css 元素选择有点问题。现在它工作正常。您可以 运行 代码。
from ast import parse
from fileinput import filename
import scrapy
from scrapy.crawler import CrawlerProcess
class PostsSpider(scrapy.Spider):
name = "posts"
start_urls= ['https://publicholidays.com.bd/2022-dates']
def parse(self, response):
for post in response.css('.publicholidays tbody tr'):
yield{
'date' : post.css('td:nth-child(1)::text').get(),
'day' : post.css('td:nth-child(2)::text' ).get(),
'event' : post.css('td:nth-child(3) a::text').get() or post.css('td:nth-child(3) span::text').get()
}
if __name__ == "__main__":
process = CrawlerProcess()
process.crawl(PostsSpider)
process.start()
输出:
{'date': '21 Feb', 'day': 'Mon', 'event': 'Shaheed Day'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '17 Mar', 'day': 'Thu', 'event': "Sheikh Mujibur Rahman's Birthday"}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '18 Mar', 'day': 'Fri', 'event': 'Shab e-Barat'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '26 Mar', 'day': 'Sat', 'event': 'Independence Day'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '14 Apr', 'day': 'Thu', 'event': 'Bengali New Year'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '28 Apr', 'day': 'Thu', 'event': 'Laylat al-Qadr'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '29 Apr', 'day': 'Fri', 'event': 'Jumatul Bidah'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '1 May', 'day': 'Sun', 'event': 'May Day'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '2 May', 'day': 'Mon', 'event': 'Eid ul-Fitr Holiday'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '3 May', 'day': 'Tue', 'event': 'Eid ul-Fitr'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '4 May', 'day': 'Wed', 'event': 'Eid ul-Fitr Holiday'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '16 May', 'day': 'Mon', 'event': 'Buddha Purnima'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '9 Jul', 'day': 'Sat', 'event': 'Eid ul-Adha Holiday'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '\n', 'day': None, 'event': None}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '10 Jul', 'day': 'Sun', 'event': 'Eid ul-Adha'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '11 Jul', 'day': 'Mon', 'event': 'Eid ul-Adha Holiday'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '9 Aug', 'day': 'Tue', 'event': 'Ashura'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '15 Aug', 'day': 'Mon', 'event': 'National Mourning Day'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '19 Aug', 'day': 'Fri', 'event': 'Shuba Janmashtami'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '5 Oct', 'day': 'Wed', 'event': 'Vijaya Dashami'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '9 Oct', 'day': 'Sun', 'event': 'Eid-e-Milad un-Nabi'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
{'date': '16 Dec', 'day': 'Fri', 'event': 'Victory Day'}
2022-04-01 16:30:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://publicholidays.com.bd/2022-dates/>
import scrapy
class QuestionSpider(scrapy.Spider):
name = 'question'
allowed_domains = ['publicholidays.com.bd']
start_urls = ['https://publicholidays.com.bd/2022-dates/']
def parse(self, response):
item = {}
for a in response.xpath("//table//tr")[:-1]:
if a.xpath("./td[1]/text()").get() != '\n':
item["date"] = a.xpath("./td[1]/text()").get()
item["day"] = a.xpath("./td[2]/text()").get()
if a.xpath(".//a/text()").get() is not None:
item["holiday"] = a.xpath(".//a/text()").get()
else:
item["holiday"] = a.xpath(".//span/text()").get()
print(item)