Scrapy

Scrapy介绍 Scrapy一个开源和协作的框架，其最初是为了页面抓取（更确切来说，网络抓取）所设计的，使用它可以快速、简单、可扩展的方式从网站中提取所需的数据。但目前Scrapy的用途十分广泛，可用于如挖掘、监测和自动化测试等领域，也可以应用在API所返回的数据（例如Amazon Associates Web Services）或者通用的网络爬虫。 Scrap...

HukDog

7348人浏览 · 2018-08-01 23:10:29

HukDog · 2018-08-01 23:10:29 发布

Scrapy介绍

Scrapy一个开源和协作的框架，其最初是为了页面抓取（更确切来说，网络抓取）所设计的，使用它可以快速、简单、可扩展

的方式从网站中提取所需的数据。但目前Scrapy的用途十分广泛，可用于如挖掘、监测和自动化测试等领域，也可以应用在API

所返回的数据（例如Amazon Associates Web Services）或者通用的网络爬虫。

Scrapy是基于twisted框架开发而来，twisted是一个流行的事件驱动的python网络框架。因此Scrapy使用了一种非阻塞（又名异步）的代码来实现并发。整体架构大致如下

Scrapy数据流是由执行的核心引擎（engine）控制，流程是这样的：

1.引擎打开一个网站（open adomain）,找到处理该网站的Spider并向该spider请求第一个要抓取的URL（s）。

2.引擎从Spider中获取到第一个要抓取的URL并在调度器（Scheduler）以Request调度。

3.引擎向调度器请求下一个要爬取的URL。

4.调度器返回下一个要抓取的URL给引擎，引擎将URL通过下载中间件（请求（request）方向）转发给下载器（Downloader）.

5.一旦页面下载完毕，下载器生成一个该页面的Response,并将其通过下载中间件（返回（response）方向）发送给引擎。

6.引擎从下载器中接收到Response并通过Spider中间件（输入方向）发送给Spider处理。

7.Spider处理Response并返回爬取到的Item给Item Pipeline,将（Spider返回的）Request给调度器。

8.引擎将（Spider返回的）爬取的Item给Item Pipeline,将（Spider返回的）Request给调度器。

9.（从第二步）重复直到调度器中没有更多地request,引擎关闭该网站。

Scrapy主要包括了一下组件：

1.爬虫引擎（engine）：爬虫引擎负责控制各个组件之间的数据流，当某些操作触发事件后都是通过engine来处理

2.调度器：调度接收来engine的请求并将请求请求放入队列中，并通过事件返回给engine

3.下载器：通过engine请求下载网络数据并将结果响应给engine

4.spider:Spider发出请求，并处理engine返回给它下载器响应数据，以items和规则内的数据请求（urls）返回给engine

5.管道数目（item pipeline）:负责处理engine返回spider解析后的数据，并且将数据持久化，例如将数据存入数据库或者文件

6.下载中间件：下载中间件是engine和下载器交互组件，以钩子（插件）的形式存在，可以代替接收请求、处理数据的下载以及

将结果响应给engine

7.spider中间件：spider中间件是engine和spider之间的交互组件，以钩子（插件）的形式存在，可以代替处理response以及返回

给engine items及新的请求集

windows环境配置

Scrapy依赖包（也可到官网单独下载各文件安装）：

1.lxml: pip install wheel

2.zope.interface:pip install zope.interface-4.3.3-cp35-cp35m-win_amd64.whl

3.pyOpenSSL:pip install pyOpenSSL

4.Twisted:pip install Twisted

5.Scrapy:pip install Scrapy

Anoconda+Pycharm+Scrapy Anaconda是包含了常用的数据科学库的Python发行版本，如果没有安装，

可以到http://www.continuum.io/downloads下载对应平台的包安装。如果已经安装，那么可以轻松地通过

conda命令安装Scrapy。conda install scrapy

Scrapy安装完成后，打开命令行终端输入scrapy,显示如下：

创建项目

创建爬虫项目命令

scrapy startproject project_name

创建爬虫文件命令

scrapy genspider example exameple.com

  D:\test>tree /F
卷 软件 的文件夹 PATH 列表
卷序列号为 58B6-0E53
D:.
└─project_dir
  │  scrapy.cfg
  │
  └─project_name
      │  items.py
      │  middlewares.py
      │  pipelines.py
      │  settings.py
      │  __init__.py
      │
      ├─spiders
      │  │  __init__.py
      │  │
      │  └─__pycache__
      └─__pycache__

items.py：定义爬虫程序的数据模型，类似于实体类。

middlewares.py：爬虫中间件，负责调度。

pipelines.py：管道文件，负责对spider返回数据的处理。

spiders目录负责存放继承自scrapy的爬虫类

scrapy.cfg.scrapy 基础配置

init：初始化文件

setting.py：负责对整个爬虫的配置，内容如下

  # -*- coding: utf-8 -*-

  # Scrapy settings for baidu project
  #
  # For simplicity, this file contains only settings considered important or
  # commonly used. You can find more settings consulting the documentation:
  #
  #     https://doc.scrapy.org/en/latest/topics/settings.html
  #     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
  #     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

  BOT_NAME = 'baidu'

  # 爬虫所在地
  SPIDER_MODULES = ['baidu.spiders']
  NEWSPIDER_MODULE = 'baidu.spiders'

		
  # Crawl responsibly by identifying yourself (and your website) on the user-agent
  #USER_AGENT = 'baidu (+http://www.yourdomain.com)'

  # Obey robots.txt rules
  # 遵守爬虫协议
  ROBOTSTXT_OBEY = False

  # Configure maximum concurrent requests performed by Scrapy (default: 16)
  # 最大请求并发量 默认16
  # CONCURRENT_REQUESTS = 32

  # configure 配置 请求延迟
  # Configure a delay for requests for the same website (default: 0)
  # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
  # See also autothrottle settings and docs
  #DOWNLOAD_DELAY = 3
  # The download delay setting will honor only one of:
  #CONCURRENT_REQUESTS_PER_DOMAIN = 16
  #CONCURRENT_REQUESTS_PER_IP = 16

  # Disable cookies (enabled by default)
  # 是否使用cookie
  #COOKIES_ENABLED = False

  # Disable Telnet Console (enabled by default)
  #TELNETCONSOLE_ENABLED = False

  # Override the default request headers:
  #DEFAULT_REQUEST_HEADERS = {
  #   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  #   'Accept-Language': 'en',
  #}

  # Enable or disable spider middlewares
  # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
  #SPIDER_MIDDLEWARES = {
  #         值越小,优先级越高,优先级越高,越先执行
  #    'baidu.middlewares.BaiduSpiderMiddleware': 543,
  #}

  # Enable or disable downloader middlewares
  # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
  #DOWNLOADER_MIDDLEWARES = {
  #         值越小,优先级越高,优先级越高,越先执行
  #    'baidu.middlewares.BaiduDownloaderMiddleware': 543,
  #}

  # Enable or disable extensions 是否进行扩展
  # See https://doc.scrapy.org/en/latest/topics/extensions.html
  #EXTENSIONS = {
  #    'scrapy.extensions.telnet.TelnetConsole': None,
  #}

  # Configure item pipelines
  # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
  ITEM_PIPELINES = {
      # 值越小,优先级越高,优先级越高,越先执行
     'baidu.pipelines.BaiduPipeline': 1,
  }

  # Enable and configure the AutoThrottle extension (disabled by default)
  # See https://doc.scrapy.org/en/latest/topics/autothrottle.html
  #AUTOTHROTTLE_ENABLED = True
  # The initial download delay
  #AUTOTHROTTLE_START_DELAY = 5
  # The maximum download delay to be set in case of high latencies
  #AUTOTHROTTLE_MAX_DELAY = 60
  # The average number of requests Scrapy should be sending in parallel to
  # each remote server
  #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
  # Enable showing throttling stats for every response received:
  #AUTOTHROTTLE_DEBUG = False

  # Enable and configure HTTP caching (disabled by default)
  # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
  #HTTPCACHE_ENABLED = True
  #HTTPCACHE_EXPIRATION_SECS = 0
  #HTTPCACHE_DIR = 'httpcache'
  #HTTPCACHE_IGNORE_HTTP_CODES = []
  #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'