Scrapy 蜘蛛 returns 没有项目数据
Scrapy spider returns no items data
我的 scrapy 脚本似乎不跟踪链接,最终无法从每个链接中提取数据(将一些内容作为 scrapy items
)。
我正在尝试从新闻网站上抓取大量数据。我设法 copy/write 一个蜘蛛,正如我假设的那样,它应该从文件中读取链接(我用另一个脚本生成了它),将它们放在 start_urls
列表中并开始跟踪这些链接以提取一些数据,然后将其作为items
传递,并且--将每个项目的数据写入一个单独的文件(最后一部分实际上是另一个问题)。
在 运行 scrapy crawl PNS
之后,脚本遍历 start_urls
中的所有链接,但仅此而已——它遵循从 start_urls
列表中读取的链接(我是在 bash 中收到 "GET link" 消息),但似乎没有输入它们并阅读更多链接以跟踪并从中提取数据。
import scrapy
import re
from ProjectName.items import ProjectNameArticle
class ProjectNameSpider(scrapy.Spider):
name = 'PNS'
allowed_domains = ['www.project-domain.com']
start_urls = []
with open('start_urls.txt', 'r') as file:
for line in file:
start_urls.append(line.strip())
def parse(self, response):
for link in response.css('div.news-wrapper_ h3.b-item__title a').xpath('@href').extract():
# extracted links look like this: "/document.html"
link = "https://project-domain.com" + link
yield scrapy.Request(link, callback=self.parse_news)
def parse_news(self, response):
data_dic = ProjectNameArticle()
data_dic['article_date'] = response.css('div.article__date::text').extract_first().strip()
data_dic['article_time'] = response.css('span.article__time::text').extract_first().strip()
data_dic['article_title'] = response.css('h3.article__title::text').extract_first().strip()
news_text = response.css('div.article__text').extract_first()
news_text = re.sub(r'(<script(\s|\S)*?<\/script>)|(<style(\s|\S)*?<\/style>)|(<!--(\s|\S)*?-->)|(<\/?(\s|\S)*?>)', '', news_text).strip()
data_dic['article_text'] = news_text
return data_dic
预期结果:
- Script opens start_urls.txt file, reads its lines (every line contains a single link), puts these links to
start_urls
list,
- For each link opened spider extracts deeper links to be followed (that's about 50-200 links for each
start_urls
link),
- Followed links are the main target from which I want to extract specific data: article title, date, time, text.
- For now never mind writing each scrapy item to a distinc .txt file.
实际结果:
- Running my spider triggers GET for each
start_urls
link, goes through around 150000, doesn't create a list of deeper links, nor enters them to extract any data.
伙计,我已经在 Python Scrapy 中编写代码很长时间了,我讨厌使用 start_urls
您可以简单地使用start_requests
,非常容易阅读,对于初学者来说也很容易学习
class ProjectNameSpider(scrapy.Spider):
name = 'PNS'
allowed_domains = ['www.project-domain.com']
def start_requests(self):
with open('start_urls.txt', 'r') as file:
for line in file:
yield Request(line.strip(),
callback=self.my_callback_func)
def my_callback_func(self, response):
for link in response.css('div.news-wrapper_ h3.b-item__title a').xpath('@href').extract():
# extracted links look like this: "/document.html"
link = "https://project-domain.com" + link
yield scrapy.Request(link, callback=self.parse_news)
def parse_news(self, response):
data_dic = ProjectNameArticle()
data_dic['article_date'] = response.css('div.article__date::text').extract_first().strip()
data_dic['article_time'] = response.css('span.article__time::text').extract_first().strip()
data_dic['article_title'] = response.css('h3.article__title::text').extract_first().strip()
news_text = response.css('div.article__text').extract_first()
news_text = re.sub(r'(<script(\s|\S)*?<\/script>)|(<style(\s|\S)*?<\/style>)|(<!--(\s|\S)*?-->)|(<\/?(\s|\S)*?>)', '', news_text).strip()
data_dic['article_text'] = news_text
return data_dic
我也没用过Item
class也觉得没用
您可以简单地使用 data_dic = {}
而不是 data_dic = ProjectNameArticle()
我的 scrapy 脚本似乎不跟踪链接,最终无法从每个链接中提取数据(将一些内容作为 scrapy items
)。
我正在尝试从新闻网站上抓取大量数据。我设法 copy/write 一个蜘蛛,正如我假设的那样,它应该从文件中读取链接(我用另一个脚本生成了它),将它们放在 start_urls
列表中并开始跟踪这些链接以提取一些数据,然后将其作为items
传递,并且--将每个项目的数据写入一个单独的文件(最后一部分实际上是另一个问题)。
在 运行 scrapy crawl PNS
之后,脚本遍历 start_urls
中的所有链接,但仅此而已——它遵循从 start_urls
列表中读取的链接(我是在 bash 中收到 "GET link" 消息),但似乎没有输入它们并阅读更多链接以跟踪并从中提取数据。
import scrapy
import re
from ProjectName.items import ProjectNameArticle
class ProjectNameSpider(scrapy.Spider):
name = 'PNS'
allowed_domains = ['www.project-domain.com']
start_urls = []
with open('start_urls.txt', 'r') as file:
for line in file:
start_urls.append(line.strip())
def parse(self, response):
for link in response.css('div.news-wrapper_ h3.b-item__title a').xpath('@href').extract():
# extracted links look like this: "/document.html"
link = "https://project-domain.com" + link
yield scrapy.Request(link, callback=self.parse_news)
def parse_news(self, response):
data_dic = ProjectNameArticle()
data_dic['article_date'] = response.css('div.article__date::text').extract_first().strip()
data_dic['article_time'] = response.css('span.article__time::text').extract_first().strip()
data_dic['article_title'] = response.css('h3.article__title::text').extract_first().strip()
news_text = response.css('div.article__text').extract_first()
news_text = re.sub(r'(<script(\s|\S)*?<\/script>)|(<style(\s|\S)*?<\/style>)|(<!--(\s|\S)*?-->)|(<\/?(\s|\S)*?>)', '', news_text).strip()
data_dic['article_text'] = news_text
return data_dic
预期结果:
- Script opens start_urls.txt file, reads its lines (every line contains a single link), puts these links to
start_urls
list,- For each link opened spider extracts deeper links to be followed (that's about 50-200 links for each
start_urls
link),- Followed links are the main target from which I want to extract specific data: article title, date, time, text.
- For now never mind writing each scrapy item to a distinc .txt file.
实际结果:
- Running my spider triggers GET for each
start_urls
link, goes through around 150000, doesn't create a list of deeper links, nor enters them to extract any data.
伙计,我已经在 Python Scrapy 中编写代码很长时间了,我讨厌使用 start_urls
您可以简单地使用start_requests
,非常容易阅读,对于初学者来说也很容易学习
class ProjectNameSpider(scrapy.Spider):
name = 'PNS'
allowed_domains = ['www.project-domain.com']
def start_requests(self):
with open('start_urls.txt', 'r') as file:
for line in file:
yield Request(line.strip(),
callback=self.my_callback_func)
def my_callback_func(self, response):
for link in response.css('div.news-wrapper_ h3.b-item__title a').xpath('@href').extract():
# extracted links look like this: "/document.html"
link = "https://project-domain.com" + link
yield scrapy.Request(link, callback=self.parse_news)
def parse_news(self, response):
data_dic = ProjectNameArticle()
data_dic['article_date'] = response.css('div.article__date::text').extract_first().strip()
data_dic['article_time'] = response.css('span.article__time::text').extract_first().strip()
data_dic['article_title'] = response.css('h3.article__title::text').extract_first().strip()
news_text = response.css('div.article__text').extract_first()
news_text = re.sub(r'(<script(\s|\S)*?<\/script>)|(<style(\s|\S)*?<\/style>)|(<!--(\s|\S)*?-->)|(<\/?(\s|\S)*?>)', '', news_text).strip()
data_dic['article_text'] = news_text
return data_dic
我也没用过Item
class也觉得没用
您可以简单地使用 data_dic = {}
而不是 data_dic = ProjectNameArticle()