Scrapy爬虫四步法：爬取51job网站

Scrapy爬虫四步法一、创建项目打开pycharm下面的Terminal窗口scrapy startproject 项目名如：scrapy startproject crawler51job二、定义要爬取的数据编写items文件（Item对象可以保存爬取到的数据，相当于存储爬取到的数据的容器。）# -*- coding: utf-8 -*-# Def...

云飞扬°

4143人浏览 · 2019-08-19 11:20:13

云飞扬° · 2019-08-19 11:20:13 发布

Scrapy爬虫四步法

一、创建项目

打开pycharm下面的Terminal窗口

scrapy startproject 项目名

如：scrapy startproject crawler51job

二、定义要爬取的数据

编写items文件（Item对象可以保存爬取到的数据，相当于存储爬取到的数据的容器。）

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class Crawler51JobItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    position = scrapy.Field()  # 职位
    company = scrapy.Field()  # 公司名
    place = scrapy.Field()  # 工作地点
    salary = scrapy.Field()  # 薪资

三、创建并编写爬虫文件

1-创建爬虫文件：

scrapy genspider -t basic 爬虫文件名 域名

2-编写爬虫文件

【注意】allowed_domains是指要爬取的网址的域名。start_urls是指爬取的起始网页

在parse函数中编写代码

要先导入items文件中的Crawler51JobItem

from crawler51job.items import Crawler51JobItem

# -*- coding: utf-8 -*-
import scrapy
from crawler51job.items import Crawler51JobItem


class Spider51jobSpider(scrapy.Spider):
    name = 'spider51job'
    allowed_domains = ['51job.com']
    start_urls = [
        'https://search.51job.com/list/010000,000000,0000,32,9,99,Java%25E5%25BC%2580%25E5%258F%2591,2,1.html']

    def parse(self, response):
        item = Crawler51JobItem()
        item['position'] = response.xpath('//div[@class="el"]/p[@class="t1 "]/span/a/@title').extract()
        item['company'] = response.xpath('//div[@class="el"]/span[@class="t2"]/a/@title').extract()
        item['place'] = response.xpath('//div[@class="el"]/span[@class="t3"]/text()').extract()
        item['salary'] = response.xpath('//div[@class="el"]/span[@class="t4"]/text()').extract()
        yield item

编写pipelines.py文件（主要用于对这些item进行处理）

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
from pandas import DataFrame

class Crawler51JobPipeline(object):
    def process_item(self, item, spider):
        # 将取出的信息放到数据框
        jobInfo = DataFrame([item['position'], item['company'], item['place'], item['salary']]).T
        # 设置列名
        jobInfo.columns = ['职位名', '公司名', '工作地点', '薪资']
        # 将数据保存到本地
        jobInfo.to_csv('jobInfo.csv',encoding='gbk')    # 设置编码格式，防止乱码
        return item

编写settings.py文件（settings文件为爬虫项目的设置文件，主要是爬虫项目的一些设置信息。）

例如，启用了pipelines，需要把settings中相关代码的注释取消