Setting up Django with Scrapy
This guide is about using Django, the most popular Python web framework, and Scrapy, the most popular Python scraping framework. Both of the frameworks are awesome, and they work very well standalone.
Before you continue reading, make sure you are already beyond “Getting Started” stage for both the frameworks.
At the end of the guide, what you can achieve is:
Run scrapy, and auto save the crawled items in Django ORM 1) Scrapy’s settings.py {{{#!highlight python def setup_django_env(path): import imp, os from django.core.management import setup_environ
f, filename, desc = imp.find_module('settings', [path])
project = imp.load_module('settings', f, filename, desc)
setup_environ(project)
# Add django project to sys.path
import sys
sys.path.append(os.path.abspath(os.path.join(path, os.path.pardir)))
setup_django_env('/path/to/django/myproject/myproject/') }}} 2) Scrapy’s items.py {{{#!highlight python from scrapy.contrib_exp.djangoitem import DjangoItem from myapp.models import Poll
class PollItem(DjangoItem): django_model = Poll }}} 3) Scrapy’s pipelines.py {{{#!highlight python from myapp.models import Poll
class PollPipeline(object):
def process_item(self, item, spider):
item.save()
return item
}}} Done!
That’s all to run scrapy and auto save the items to Django ORM. You can now run your regular {{{#!highlight bash scrapy crawl myspider }}} PS: This guide serves to be complete. It adds to a popular Stackoverflow answer, and completes the picture for Django 1.4, which Django adopts a new layout. And also provide the code for the experimental DjangoItem (rare!).
== 另外一个 ==
{{{#!highlight python
# django-admin.py startproject djangoapp
# Create your django model: django startapp website
# Edit scrapy settings.py with method to point to Django environment
# Create a pipeline that accesses Django using the model.save() method
settings.py
import os ITEM_PIPELINES = ['myapp.pipelines.DjangoPipeline']
http://stackoverflow.com/questions/4271975/access-django-models-inside-of-scrapy
def setup_django_env(path): import imp, os from django.core.management import setup_environ
f, filename, desc = imp.find_module('settings', [path])
project = imp.load_module('settings', f, filename, desc)
setup_environ(project)
current_dir = os.path.abspath(os.path.dirname(os.path.dirname(file))) setup_django_env(os.path.join(current_dir, '../djangoapp/')) #注意此处 djangoapp,不是project目录,应该'/djangoproject/djangoapp/'。确切的说是django的settings.py所在目录。
pipelines.py from djangoapp.websites.models import Website from django.db.utils import IntegrityError
class DjangoPipeline(object):
def process_item(self, item, spider):
website = Website(link=item['link'][0],
created=datetime.datetime.now(),
)
try:
website.save()
except IntegrityError:
raise DropItem("Contains duplicate domain: %s" % item['link'][0])
return item
djangoapp model
from django.db import models
class Website(models.Model): link = models.CharField(max_length=200, unique=True) created = models.DateTimeField('date created')
def __unicode__(self):
return u"%s" % self.link
Snippet imported from snippets.scrapy.org (which no longer works)
author: redtricycle
date : Nov 27, 2011
}}}
Comments !