Scrapy 管道 SQL 语法错误
Scrapy Pipeline SQL Syntax error
我有一个蜘蛛,它从 MySQL 数据库中抓取 URL,并使用那些 URL 作为 start_urls 进行抓取,进而抓取来自抓取页面的任意数量的新链接。当我将管道设置为将 start_url 和新抓取的 url 都插入到新数据库时,或者当我将管道设置为使用新抓取的 URL 更新现有数据库时start_url 作为 WHERE 标准,我得到一个 SQL 语法错误。
当我只插入一个或另一个时,我没有收到错误。
这里是spider.py
import scrapy
import MySQLdb
import MySQLdb.cursors
from scrapy.http.request import Request
from youtubephase2.items import Youtubephase2Item
class youtubephase2(scrapy.Spider):
name = 'youtubephase2'
def start_requests(self):
conn = MySQLdb.connect(user='uname', passwd='password', db='YouTubeScrape', host='localhost', charset="utf8", use_unicode=True)
cursor = conn.cursor()
cursor.execute('SELECT resultURL FROM SearchResults;')
rows = cursor.fetchall()
for row in rows:
if row:
yield Request(row[0], self.parse, meta=dict(start_url=row[0]))
cursor.close()
def parse(self, response):
for sel in response.xpath('//a[contains(@class, "yt-uix-servicelink")]'):
item = Youtubephase2Item()
item['newurl'] = sel.xpath('@href').extract()
item['start_url'] = response.meta['start_url']
yield item
这是 pipeline.py,它显示了所有三个 self.cursor.execute 语句
import MySQLdb
import MySQLdb.cursors
import hashlib
from scrapy import log
from scrapy.exceptions import DropItem
from twisted.enterprise import adbapi
from youtubephase2.items import Youtubephase2Item
class MySQLStorePipeline(object):
def __init__(self):
self.conn = MySQLdb.connect(user='uname', passwd='password', db='YouTubeScrape', host='localhost', charset="utf8", use_unicode=True)
self.cursor = self.conn.cursor()
def process_item(self, item, spider):
try:
#self.cursor.execute("""UPDATE SearchResults SET NewURL = %s WHERE ResultURL = %s VALUES (%s, %s)""",(item['newurl'], item['start_url']))
#self.cursor.execute("""UPDATE SearchResults SET NewURL = %s WHERE ResultURL = %s""",(item['newurl'], item['start_url']))
self.cursor.execute("""INSERT INTO TestResults (NewURL, StartURL) VALUES (%s, %s)""",(item['newurl'], item['start_url']))
self.conn.commit()
except MySQLdb.Error, e:
log.msg("Error %d: %s" % (e.args[0], e.args[1]))
return item
最上面的SQL执行语句returns出现这个错误:
2017-04-13 18:29:34 [scrapy.core.scraper] ERROR: Error processing {'newurl': [u'http://www.tagband.co.uk/'],
'start_url': u'https://www.youtube.com/watch?v=UqguztfQPho'}
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 653, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/root/scraping/youtubephase2/youtubephase2/pipelines.py", line 18, in process_item
self.cursor.execute("""UPDATE SearchResults SET AffiliateURL = %s WHERE ResultURL = %s VALUES (%s, %s)""",(item['affiliateurl'], item['start_url']))
File "/usr/lib/python2.7/dist-packages/MySQLdb/cursors.py", line 159, in execute
query = query % db.literal(args)
TypeError: not enough arguments for format string
中间SQL执行语句returns出现这个错误:
2017-04-13 18:33:18 [scrapy.log] INFO: Error 1064: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ') WHERE ResultURL = 'https://www.youtube.com/watch?v=UqguztfQPho'' at line 1
2017-04-13 18:33:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.youtube.com/watch?v=UqguztfQPho>
{'newurl': [u'http://www.tagband.co.uk/'],
'start_url': u'https://www.youtube.com/watch?v=UqguztfQPho'}
最后一个SQL 执行语句returns 即使在新数据库中使用INSERT 时也会出现与中间相同的错误。似乎添加了一个额外的单引号。当我只将其中一项插入数据库时,最后一个有效。
2017-04-13 18:36:40 [scrapy.log] INFO: Error 1064: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '), 'https://www.youtube.com/watch?v=UqguztfQPho')' at line 1
2017-04-13 18:36:40 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.youtube.com/watch?v=UqguztfQPho>
{'newurl': [u'http://www.tagband.co.uk/'],
'start_url': u'https://www.youtube.com/watch?v=UqguztfQPho'}
很抱歉 post。力求彻底。
我想通了。这个问题与我将列表传递给 MySQL 执行管道这一事实有关。
我创建了一个管道,在到达 MySQL 管道之前,使用“”.join(item['newurl']) 和 returns 将列表转换为字符串。
也许有更好的方法来更改 ['newurl'] = sel.xpath('@href').extract() 行中的 spider.py 以提取第一个列表中的项目或将其转换为文本,但管道对我有用。
我有一个蜘蛛,它从 MySQL 数据库中抓取 URL,并使用那些 URL 作为 start_urls 进行抓取,进而抓取来自抓取页面的任意数量的新链接。当我将管道设置为将 start_url 和新抓取的 url 都插入到新数据库时,或者当我将管道设置为使用新抓取的 URL 更新现有数据库时start_url 作为 WHERE 标准,我得到一个 SQL 语法错误。
当我只插入一个或另一个时,我没有收到错误。
这里是spider.py
import scrapy
import MySQLdb
import MySQLdb.cursors
from scrapy.http.request import Request
from youtubephase2.items import Youtubephase2Item
class youtubephase2(scrapy.Spider):
name = 'youtubephase2'
def start_requests(self):
conn = MySQLdb.connect(user='uname', passwd='password', db='YouTubeScrape', host='localhost', charset="utf8", use_unicode=True)
cursor = conn.cursor()
cursor.execute('SELECT resultURL FROM SearchResults;')
rows = cursor.fetchall()
for row in rows:
if row:
yield Request(row[0], self.parse, meta=dict(start_url=row[0]))
cursor.close()
def parse(self, response):
for sel in response.xpath('//a[contains(@class, "yt-uix-servicelink")]'):
item = Youtubephase2Item()
item['newurl'] = sel.xpath('@href').extract()
item['start_url'] = response.meta['start_url']
yield item
这是 pipeline.py,它显示了所有三个 self.cursor.execute 语句
import MySQLdb
import MySQLdb.cursors
import hashlib
from scrapy import log
from scrapy.exceptions import DropItem
from twisted.enterprise import adbapi
from youtubephase2.items import Youtubephase2Item
class MySQLStorePipeline(object):
def __init__(self):
self.conn = MySQLdb.connect(user='uname', passwd='password', db='YouTubeScrape', host='localhost', charset="utf8", use_unicode=True)
self.cursor = self.conn.cursor()
def process_item(self, item, spider):
try:
#self.cursor.execute("""UPDATE SearchResults SET NewURL = %s WHERE ResultURL = %s VALUES (%s, %s)""",(item['newurl'], item['start_url']))
#self.cursor.execute("""UPDATE SearchResults SET NewURL = %s WHERE ResultURL = %s""",(item['newurl'], item['start_url']))
self.cursor.execute("""INSERT INTO TestResults (NewURL, StartURL) VALUES (%s, %s)""",(item['newurl'], item['start_url']))
self.conn.commit()
except MySQLdb.Error, e:
log.msg("Error %d: %s" % (e.args[0], e.args[1]))
return item
最上面的SQL执行语句returns出现这个错误:
2017-04-13 18:29:34 [scrapy.core.scraper] ERROR: Error processing {'newurl': [u'http://www.tagband.co.uk/'],
'start_url': u'https://www.youtube.com/watch?v=UqguztfQPho'}
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 653, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/root/scraping/youtubephase2/youtubephase2/pipelines.py", line 18, in process_item
self.cursor.execute("""UPDATE SearchResults SET AffiliateURL = %s WHERE ResultURL = %s VALUES (%s, %s)""",(item['affiliateurl'], item['start_url']))
File "/usr/lib/python2.7/dist-packages/MySQLdb/cursors.py", line 159, in execute
query = query % db.literal(args)
TypeError: not enough arguments for format string
中间SQL执行语句returns出现这个错误:
2017-04-13 18:33:18 [scrapy.log] INFO: Error 1064: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ') WHERE ResultURL = 'https://www.youtube.com/watch?v=UqguztfQPho'' at line 1
2017-04-13 18:33:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.youtube.com/watch?v=UqguztfQPho>
{'newurl': [u'http://www.tagband.co.uk/'],
'start_url': u'https://www.youtube.com/watch?v=UqguztfQPho'}
最后一个SQL 执行语句returns 即使在新数据库中使用INSERT 时也会出现与中间相同的错误。似乎添加了一个额外的单引号。当我只将其中一项插入数据库时,最后一个有效。
2017-04-13 18:36:40 [scrapy.log] INFO: Error 1064: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '), 'https://www.youtube.com/watch?v=UqguztfQPho')' at line 1
2017-04-13 18:36:40 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.youtube.com/watch?v=UqguztfQPho>
{'newurl': [u'http://www.tagband.co.uk/'],
'start_url': u'https://www.youtube.com/watch?v=UqguztfQPho'}
很抱歉 post。力求彻底。
我想通了。这个问题与我将列表传递给 MySQL 执行管道这一事实有关。
我创建了一个管道,在到达 MySQL 管道之前,使用“”.join(item['newurl']) 和 returns 将列表转换为字符串。
也许有更好的方法来更改 ['newurl'] = sel.xpath('@href').extract() 行中的 spider.py 以提取第一个列表中的项目或将其转换为文本,但管道对我有用。