将蜘蛛结果保存到数据库
Saving spider results to database
目前正在考虑将我抓取的数据保存到数据库中的好方法。
应用流程:
- 运行 spider(数据抓取工具),文件位于 spiders/
- 成功收集数据后,使用 pipeline.py [=44] 中的 class 将 data/items (title, link, pubDate) 保存到数据库=]
我需要你的帮助,了解如何通过 pipeline.py 从 spider.py 将抓取的数据(标题,link,pubDate)保存到数据库中,目前我没有任何连接这些文件放在一起。当数据爬取成功需要通知管道,接收数据并保存
非常感谢你的帮助
Spider.py
import urllib.request
import lxml.etree as ET
opener = urllib.request.build_opener()
tree = ET.parse(opener.open('https://nordfront.se/feed'))
items = [{'title': item.find('title').text, 'link': item.find('link').text, 'pubdate': item.find('pubDate').text} for item in tree.xpath("/rss/channel/item")]
for item in items:
print(item['title'], item['link'], item['pubdate'])
Models.py
from sqlalchemy import create_engine, Column, Integer, String, DateTime
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.engine.url import URL
from sqlalchemy import UniqueConstraint
import datetime
import settings
def db_connect():
"""
Performs database connection using database settings from settings.py.
Returns sqlalchemy engine instance
"""
return create_engine(URL(**settings.DATABASE))
DeclarativeBase = declarative_base()
# <--snip-->
def create_presstv_table(engine):
DeclarativeBase.metadata.create_all(engine)
def create_nordfront_table(engine):
DeclarativeBase.metadata.create_all(engine)
def _get_date():
return datetime.datetime.now()
class Nordfront(DeclarativeBase):
"""Sqlalchemy deals model"""
__tablename__ = "nordfront"
id = Column(Integer, primary_key=True)
title = Column('title', String)
description = Column('description', String, nullable=True)
link = Column('link', String, unique=True)
date = Column('date', String, nullable=True)
created_at = Column('created_at', DateTime, default=_get_date)
Pipeline.py
from sqlalchemy.orm import sessionmaker
from models import Nordfront, db_connect, create_nordfront_table
class NordfrontPipeline(object):
"""Pipeline for storing scraped items in the database"""
def __init__(self):
"""
Initializes database connection and sessionmaker.
Creates deals table.
"""
engine = db_connect()
create_nordfront_table(engine)
self.Session = sessionmaker(bind=engine)
def process_item(self, item, spider):
"""Save data in the database.
This method is called for every item pipeline component.
"""
session = self.Session()
deal = Nordfront(**item)
if session.query(Nordfront).filter_by(link=item['link']).first() == None:
try:
session.add(deal)
session.commit()
except:
session.rollback()
raise
finally:
session.close()
return item
Settings.py
DATABASE = {'drivername': 'postgres',
'host': 'localhost',
'port': '5432',
'username': 'toothfairy',
'password': 'password123',
'database': 'news'}
据我了解,这是一个Scrapy特有的问题。如果是这样,您只需要 activate your pipeline in settings.py
:
ITEM_PIPELINES = {
'myproj.pipeline.NordfrontPipeline': 100
}
这会让引擎知道将已抓取的项目发送到管道(请参阅 control flow):
如果我们不是在谈论 Scrapy,那么,直接从您的蜘蛛中调用 process_item()
:
from pipeline import NordfrontPipeline
...
pipeline = NordfrontPipeline()
for item in items:
pipeline.process_item(item, None)
您还可以从 process_item()
管道方法中删除 spider
参数,因为它未被使用。
目前正在考虑将我抓取的数据保存到数据库中的好方法。
应用流程:
- 运行 spider(数据抓取工具),文件位于 spiders/
- 成功收集数据后,使用 pipeline.py [=44] 中的 class 将 data/items (title, link, pubDate) 保存到数据库=]
我需要你的帮助,了解如何通过 pipeline.py 从 spider.py 将抓取的数据(标题,link,pubDate)保存到数据库中,目前我没有任何连接这些文件放在一起。当数据爬取成功需要通知管道,接收数据并保存
非常感谢你的帮助
Spider.py
import urllib.request
import lxml.etree as ET
opener = urllib.request.build_opener()
tree = ET.parse(opener.open('https://nordfront.se/feed'))
items = [{'title': item.find('title').text, 'link': item.find('link').text, 'pubdate': item.find('pubDate').text} for item in tree.xpath("/rss/channel/item")]
for item in items:
print(item['title'], item['link'], item['pubdate'])
Models.py
from sqlalchemy import create_engine, Column, Integer, String, DateTime
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.engine.url import URL
from sqlalchemy import UniqueConstraint
import datetime
import settings
def db_connect():
"""
Performs database connection using database settings from settings.py.
Returns sqlalchemy engine instance
"""
return create_engine(URL(**settings.DATABASE))
DeclarativeBase = declarative_base()
# <--snip-->
def create_presstv_table(engine):
DeclarativeBase.metadata.create_all(engine)
def create_nordfront_table(engine):
DeclarativeBase.metadata.create_all(engine)
def _get_date():
return datetime.datetime.now()
class Nordfront(DeclarativeBase):
"""Sqlalchemy deals model"""
__tablename__ = "nordfront"
id = Column(Integer, primary_key=True)
title = Column('title', String)
description = Column('description', String, nullable=True)
link = Column('link', String, unique=True)
date = Column('date', String, nullable=True)
created_at = Column('created_at', DateTime, default=_get_date)
Pipeline.py
from sqlalchemy.orm import sessionmaker
from models import Nordfront, db_connect, create_nordfront_table
class NordfrontPipeline(object):
"""Pipeline for storing scraped items in the database"""
def __init__(self):
"""
Initializes database connection and sessionmaker.
Creates deals table.
"""
engine = db_connect()
create_nordfront_table(engine)
self.Session = sessionmaker(bind=engine)
def process_item(self, item, spider):
"""Save data in the database.
This method is called for every item pipeline component.
"""
session = self.Session()
deal = Nordfront(**item)
if session.query(Nordfront).filter_by(link=item['link']).first() == None:
try:
session.add(deal)
session.commit()
except:
session.rollback()
raise
finally:
session.close()
return item
Settings.py
DATABASE = {'drivername': 'postgres',
'host': 'localhost',
'port': '5432',
'username': 'toothfairy',
'password': 'password123',
'database': 'news'}
据我了解,这是一个Scrapy特有的问题。如果是这样,您只需要 activate your pipeline in settings.py
:
ITEM_PIPELINES = {
'myproj.pipeline.NordfrontPipeline': 100
}
这会让引擎知道将已抓取的项目发送到管道(请参阅 control flow):
如果我们不是在谈论 Scrapy,那么,直接从您的蜘蛛中调用 process_item()
:
from pipeline import NordfrontPipeline
...
pipeline = NordfrontPipeline()
for item in items:
pipeline.process_item(item, None)
您还可以从 process_item()
管道方法中删除 spider
参数,因为它未被使用。