如何将一个简单的项目与 scrapy 项目结合起来?
How can I combine a simple project with scrapy project?
我有一个 scrapy 项目的例子。这几乎是默认的。它的文件夹结构:
craiglist_sample/
├── craiglist_sample
│ ├── __init__.py
│ ├── items.py
│ ├── pipelines.py
│ ├── settings.py
│ └── spiders
│ ├── __init__.py
│ ├── test.py
└── scrapy.cfg
当您将 scrapy crawl craigs -o items.csv -t csv
写入 windows 命令提示符时,它会将 craiglist 项目和链接写入控制台。
我想在主文件夹中创建一个 example.py 并将这些项目打印到其中的 python 控制台。
我试过了
from scrapy import cmdline
cmdline.execute("scrapy crawl craigs".split())
但它写的和 windows shell 输出一样。我怎样才能让它只打印项目和列表?
test.py
:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from craiglist_sample.items import CraiglistSampleItem
class MySpider(CrawlSpider):
name = "craigs"
## allowed_domains = ["sfbay.craigslist.org"]
## start_urls = ["http://sfbay.craigslist.org/npo/"]
allowed_domains = ["craigslist.org"]
start_urls = ["http://sfbay.tr.craigslist.org/search/npo?"]
##search\/npo\?s=
rules = (Rule (SgmlLinkExtractor(allow=('s=\d00',),restrict_xpaths=('//a[@class="button next"]',))
, callback="parse_items", follow= True),
)
def parse_items(self, response):
hxs = HtmlXPathSelector(response)
titles = hxs.select('//span[@class="pl"]')
## titles = hxs.select("//p[@class='row']")
items = []
for titles in titles:
item = CraiglistSampleItem()
item ["title"] = titles.select("a/text()").extract()
item ["link"] = titles.select("a/@href").extract()
items.append(item)
return(items)
一种方法是关闭 scrapy 的默认 shell 输出并在 parse_items 函数中插入打印命令。
1 - 关闭文件 settings.py
中的调试级别
LOG_ENABLED = False
关于 Scrapy 中日志级别的文档:http://doc.scrapy.org/en/latest/topics/logging.html
2 - 为您感兴趣的项目添加打印命令
for titles in titles:
item = CraiglistSampleItem()
item ["title"] = titles.select("a/text()").extract()
item ["link"] = titles.select("a/@href").extract()
items.append(item)
print item ["title"], item ["link"]
shell 输出将是:
[u'EXECUTIVE ASSISTANT'] [u'/eby/npo/4848086929.html']
[u'Direct Support Professional'] [u'/eby/npo/4848043371.html']
[u'Vocational Counselor'] [u'/eby/npo/4848042572.html']
[u'Day Program Supervisor'] [u'/eby/npo/4848041846.html']
[u'Educational Specialist'] [u'/eby/npo/4848040348.html']
[u'ORGANIZE WITH GREENPEACE - Grassroots Nonprofit Job!']
[u'/eby/npo/4847984654.html']
编辑从脚本执行的代码
import os
os.system('scrapy crawl craigs > log.txt')
在python中还有其他几种执行行程序的方法。
检查 Executing command line programs from within python and Calling an external command in Python
我有一个 scrapy 项目的例子。这几乎是默认的。它的文件夹结构:
craiglist_sample/
├── craiglist_sample
│ ├── __init__.py
│ ├── items.py
│ ├── pipelines.py
│ ├── settings.py
│ └── spiders
│ ├── __init__.py
│ ├── test.py
└── scrapy.cfg
当您将 scrapy crawl craigs -o items.csv -t csv
写入 windows 命令提示符时,它会将 craiglist 项目和链接写入控制台。
我想在主文件夹中创建一个 example.py 并将这些项目打印到其中的 python 控制台。
我试过了
from scrapy import cmdline
cmdline.execute("scrapy crawl craigs".split())
但它写的和 windows shell 输出一样。我怎样才能让它只打印项目和列表?
test.py :
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from craiglist_sample.items import CraiglistSampleItem
class MySpider(CrawlSpider):
name = "craigs"
## allowed_domains = ["sfbay.craigslist.org"]
## start_urls = ["http://sfbay.craigslist.org/npo/"]
allowed_domains = ["craigslist.org"]
start_urls = ["http://sfbay.tr.craigslist.org/search/npo?"]
##search\/npo\?s=
rules = (Rule (SgmlLinkExtractor(allow=('s=\d00',),restrict_xpaths=('//a[@class="button next"]',))
, callback="parse_items", follow= True),
)
def parse_items(self, response):
hxs = HtmlXPathSelector(response)
titles = hxs.select('//span[@class="pl"]')
## titles = hxs.select("//p[@class='row']")
items = []
for titles in titles:
item = CraiglistSampleItem()
item ["title"] = titles.select("a/text()").extract()
item ["link"] = titles.select("a/@href").extract()
items.append(item)
return(items)
一种方法是关闭 scrapy 的默认 shell 输出并在 parse_items 函数中插入打印命令。
1 - 关闭文件 settings.py
中的调试级别LOG_ENABLED = False
关于 Scrapy 中日志级别的文档:http://doc.scrapy.org/en/latest/topics/logging.html
2 - 为您感兴趣的项目添加打印命令
for titles in titles:
item = CraiglistSampleItem()
item ["title"] = titles.select("a/text()").extract()
item ["link"] = titles.select("a/@href").extract()
items.append(item)
print item ["title"], item ["link"]
shell 输出将是:
[u'EXECUTIVE ASSISTANT'] [u'/eby/npo/4848086929.html']
[u'Direct Support Professional'] [u'/eby/npo/4848043371.html']
[u'Vocational Counselor'] [u'/eby/npo/4848042572.html']
[u'Day Program Supervisor'] [u'/eby/npo/4848041846.html']
[u'Educational Specialist'] [u'/eby/npo/4848040348.html']
[u'ORGANIZE WITH GREENPEACE - Grassroots Nonprofit Job!']
[u'/eby/npo/4847984654.html']
编辑从脚本执行的代码
import os
os.system('scrapy crawl craigs > log.txt')
在python中还有其他几种执行行程序的方法。 检查 Executing command line programs from within python and Calling an external command in Python