Python Scrapy 跨平台爬虫实战：XPath 解析与结构化数据提取

Z_suger7

340人浏览 · 2026-06-29 16:47:37

Z_suger7 · 2026-06-29 16:47:37 发布

爬虫开发中，请求—下载—解析—存储是最基础的四段流水线。请求和下载部分各语言方案大同小异，真正拉开效率差距的是解析层。BeautifulSoup 面对深层嵌套、条件筛选时力不从心；正则可读性差、维护成本高。XPath 是 W3C 标准查询语言，专为树结构设计，配合 Scrapy 的异步引擎，在大规模、跨平台爬虫项目中几乎没有对手。

一、Scrapy 项目初始化

pip install scrapy
scrapy startproject multispider && cd multispider
scrapy genspider technews example.com

在 **items.py** 中声明结构化字段：

import scrapy

class NewsItem(scrapy.Item):
    title        = scrapy.Field()
    url          = scrapy.Field()
    author       = scrapy.Field()
    publish_date = scrapy.Field()
    content      = scrapy.Field()
    tags         = scrapy.Field()
    source       = scrapy.Field()

二、XPath 高频语法速查

场景	表达式	说明
全局搜索	`<font style="color:rgb(0, 206, 185);background-color:rgb(252, 252, 252);">//div[@class="list"]</font>`	不关心层级
相对定位	`<font style="color:rgb(0, 206, 185);background-color:rgb(252, 252, 252);">.//h2/a/@href</font>`	以当前节点为根，实战最关键
模糊匹配	`<font style="color:rgb(0, 206, 185);background-color:rgb(252, 252, 252);">contains(@class, "active")</font>`	多 class 场景必用
位置限定	`<font style="color:rgb(0, 206, 185);background-color:rgb(252, 252, 252);">//li[position()<=3]</font>`	取前 N 个
轴遍历	`<font style="color:rgb(0, 206, 185);background-color:rgb(252, 252, 252);">//h2/following-sibling::p</font>`	取兄弟节点
条件排除	`<font style="color:rgb(0, 206, 185);background-color:rgb(252, 252, 252);">//p[not(contains(@class,"ad"))]</font>`	XPath 原生过滤广告

核心原则：循环遍历列表项时，子元素 XPath 必须以 **.** 开头（**.//**），否则会回到整个文档根节点全局搜索，导致数据错位。

三、核心爬虫：列表页 → 详情页两级解析

编辑 **spiders/technews.py**：

import scrapy
from multispider.items import NewsItem

class TechNewsSpider(scrapy.Spider):
    name = 'technews'
    allowed_domains = ['example.com']
    start_urls = ['https://example.com/news']

    def parse(self, response):
        # 列表页：定位所有文章条目
        for article in response.xpath('//div[@class="article-list"]/article'):
            detail_url = article.xpath('.//h2/a/@href').get()
            if detail_url:
                yield response.follow(
                    detail_url,
                    callback=self.parse_detail,
                    meta={'list_title': article.xpath('.//h2/a/text()').get(default='').strip()}
                )

        # 翻页
        next_page = response.xpath('//a[contains(@class,"next")]/@href').get()
        if next_page:
            yield response.follow(next_page, callback=self.parse)

    def parse_detail(self, response):
        item = NewsItem()
        item['title'] = (
            response.xpath('//h1[@class="article-title"]/text()').get(default='').strip()
            or response.meta.get('list_title', '')
        )
        item['url']          = response.url
        item['author']       = response.xpath('//span[@class="author-name"]/text()').get(default='匿名').strip()
        item['publish_date'] = response.xpath('//time[@class="publish-date"]/@datetime').get()
        item['tags']         = response.xpath('//div[@class="tags"]//a/text()').getall()
        item['source']       = 'technews'

        # 正文提取：排除广告/推荐节点
        paragraphs = response.xpath(
            '//div[@class="article-body"]'
            '//p[not(contains(@class,"ad")) and not(contains(@class,"recommend"))]'
            '/text()'
        ).getall()
        item['content'] = '\n'.join(p.strip() for p in paragraphs if p.strip())

        yield item

四个关键技巧：

**.//** ****相对路径：循环体内必须用 **.** 开头，避免跨条目误抓
**get(default='')**：防止 **NoneType** 错误，提供安全兜底
**response.follow()**：自动补全相对 URL，无需手动拼域名
**meta** ****透传：列表页元数据传递到详情页，做 fallback 容错

四、跨平台适配：规则配置与爬虫逻辑解耦

不同站点 HTML 结构不同，但数据模型和清洗逻辑完全可复用。核心思路是将 XPath 规则抽成配置字典：

SITE_RULES = {
    'siteA': {
        'start_urls':    ['https://site-a.com/news'],
        'list_item':     '//div[@class="news-item"]',
        'detail_link':   './/a[@class="title"]/@href',
        'title':         '//h1[@class="post-title"]/text()',
        'author':        '//span[@itemprop="author"]/text()',
        'publish_date':  '//meta[@property="article:published_time"]/@content',
        'content':       '//div[@class="post-content"]//p/text()',
        'tags':          '//div[@class="tag-list"]//a/text()',
        'next_page':     '//a[@rel="next"]/@href',
    },
    'siteB': {
        # ... 另一个站点的规则
    },
}

class MultiSiteSpider(scrapy.Spider):
    name = 'multisite'

    def start_requests(self):
        for site_name, rules in SITE_RULES.items():
            for url in rules['start_urls']:
                yield scrapy.Request(url, callback=self.parse_list,
                                     meta={'site_name': site_name, 'rules': rules})

    def parse_list(self, response):
        rules = response.meta['rules']
        for article in response.xpath(rules['list_item']):
            link = article.xpath(rules['detail_link']).get()
            if link:
                yield response.follow(link, callback=self.parse_detail, meta=response.meta)
        # 翻页
        next_page = response.xpath(rules['next_page']).get()
        if next_page:
            yield response.follow(next_page, callback=self.parse_list, meta=response.meta)

    def parse_detail(self, response):
        rules = response.meta['rules']
        item = NewsItem()
        item['url']    = response.url
        item['source'] = response.meta['site_name']
        item['title']  = response.xpath(rules['title']).get(default='').strip()
        item['author'] = response.xpath(rules['author']).get(default='匿名').strip()
        item['content'] = '\n'.join(p.strip() for p in response.xpath(rules['content']).getall() if p.strip())
        item['tags']   = response.xpath(rules['tags']).getall()
        yield item

新增站点只需加一段规则配置，核心代码零改动——这是 Scrapy 跨平台扩展的工程优势。

五、接入代理 IP：突破反爬封锁

跨平台大规模爬虫必然触发目标站点的 IP 频率限制。以亿牛云爬虫代理为例，在 Scrapy 中接入代理只需编写一个下载器中间件。

新建 **middlewares.py**：

import base64
import random

def base64ify(bytes_or_str):
    """生成 Proxy-Authorization 认证头"""
    input_bytes = bytes_or_str.encode('utf8') if isinstance(bytes_or_str, str) else bytes_or_str
    return base64.urlsafe_b64encode(input_bytes).decode('ascii')

class ProxyMiddleware(object):
    def process_request(self, request, spider):
        # 亿牛云爬虫代理参数（官网 www.16yun.cn）
        proxyHost = "t.16yun.cn"
        proxyPort = "31111"
        proxyUser = "username"    # 替换为你的用户名
        proxyPass = "password"    # 替换为你的密码

        # 设置代理地址
        request.meta['proxy'] = f"http://{proxyHost}:{proxyPort}"

        # 添加认证头（Scrapy 2.6.2+ 可省略，会自动设置）
        request.headers['Proxy-Authorization'] = 'Basic ' + base64ify(f"{proxyUser}:{proxyPass}")

        # 设置 Proxy-Tunnel：相同随机数 = 相同出口 IP（适合需要登录态保持的场景）
        tunnel = random.randint(1, 10000)
        request.headers['Proxy-Tunnel'] = str(tunnel)

        # 如需每个请求强制切换 IP，关闭连接复用
        request.headers['Connection'] = "Close"

在 **settings.py** 中启用中间件并配置重试策略：

DOWNLOADER_MIDDLEWARES = {
    'multispider.middlewares.ProxyMiddleware': 100,
}

# 代理认证失败（407）时自动重试
RETRY_ENABLED = True
RETRY_TIMES = 3
RETRY_HTTP_CODES = [500, 502, 503, 504, 407, 408, 429]

# 并发与限速
CONCURRENT_REQUESTS = 16
DOWNLOAD_DELAY = 0.5
DOWNLOAD_TIMEOUT = 15

代理 IP 使用要点：

场景	配置方式	说明
每次请求换 IP	`<font style="color:rgb(0, 206, 185);background-color:rgb(252, 252, 252);">Connection: Close</font>` + 随机 Tunnel	最常用，适合批量抓取
保持同一 IP	固定 `<font style="color:rgb(0, 206, 185);background-color:rgb(252, 252, 252);">Proxy-Tunnel</font>` 值	适合需要登录/Cookie 缓存的流程
HTTPS 站点	使用库原生代理认证	避免手动 `<font style="color:rgb(0, 206, 185);background-color:rgb(252, 252, 252);">Proxy-Authorization</font>` 被转发到目标站
407 错误	检查域名/端口/用户名/密码	认证信息错误
429 错误	降低并发或增加延迟	请求速率超出订单上限

六、数据清洗管道

编辑 **pipelines.py**，将清洗逻辑与爬虫逻辑分离：

import re
import json
from datetime import datetime
from itemadapter import ItemAdapter
from scrapy.exceptions import DropItem

class DataCleaningPipeline:
    def process_item(self, item, spider):
        adapter = ItemAdapter(item)

        # 去除控制字符和首尾空白
        for field in ['title', 'author', 'content']:
            val = adapter.get(field, '')
            if val:
                val = re.sub(r'[\x00-\x1f\x7f-\x9f\u00a0]', '', val).strip()
                adapter[field] = val if val else None

        # 标签去重
        tags = adapter.get('tags', [])
        seen, cleaned = set(), []
        for tag in (t.strip() for t in tags if t.strip()):
            key = tag.lower()
            if key not in seen:
                seen.add(key)
                cleaned.append(tag)
        adapter['tags'] = cleaned[:10]

        # 必填校验
        if not adapter.get('title'):
            raise DropItem("Missing title")
        return item

class JsonExportPipeline:
    def open_spider(self, spider):
        self.file = open('output.jsonl', 'w', encoding='utf-8')
    def process_item(self, item, spider):
        self.file.write(json.dumps(dict(item), ensure_ascii=False) + '\n')
        return item
    def close_spider(self, spider):
        self.file.close()

# settings.py
ITEM_PIPELINES = {
    'multispider.pipelines.DataCleaningPipeline': 100,
    'multispider.pipelines.JsonExportPipeline': 200,
}

七、调试与运行

# Scrapy Shell 验证 XPath（写代码前必做）
scrapy shell 'https://example.com/news'
>>> response.xpath('//h1[@class="article-title"]/text()').get()
'Python 3.12 新特性解析'

# 运行爬虫
scrapy crawl technews -o results.json

八、XPath 避坑指南

陷阱	错误写法	正确写法
全局搜索误抓	`<font style="color:rgb(0, 206, 185);background-color:rgb(252, 252, 252);">article.xpath('//h2/text()')</font>`	`<font style="color:rgb(0, 206, 185);background-color:rgb(252, 252, 252);">article.xpath('.//h2/text()')</font>`
多 class 失配	`<font style="color:rgb(0, 206, 185);background-color:rgb(252, 252, 252);">@class="item active"</font>`	`<font style="color:rgb(0, 206, 185);background-color:rgb(252, 252, 252);">contains(@class, "active")</font>`
空白未处理	`<font style="color:rgb(0, 206, 185);background-color:rgb(252, 252, 252);">.get()</font>` 直接用	`<font style="color:rgb(0, 206, 185);background-color:rgb(252, 252, 252);">.get(default='').strip()</font>`
编码乱码	默认编码	`<font style="color:rgb(0, 206, 185);background-color:rgb(252, 252, 252);">FEED_EXPORT_ENCODING='utf-8'</font>`

九、总结

Scrapy + XPath 的工程价值集中在三个层面：

解析层：XPath 的树结构查询能力远超 BeautifulSoup，深层嵌套、多条件筛选、跨轴遍历是原生优势
架构层：异步引擎 + 中间件 + Pipeline 天然支持大规模、跨平台扩展。规则配置与爬虫逻辑解耦，新增站点边际成本趋近于零
反爬层：通过代理 IP 中间件（如亿牛云爬虫代理）无缝接入 IP 池，**Proxy-Tunnel** 机制精确控制 IP 切换时机，配合 407 重试策略保障稳定性

实际项目中，先用 **scrapy shell** 验证 XPath 表达式再写代码；清洗逻辑统一收敛到 Pipeline；代理中间件根据业务场景选择随机 IP 或固定 IP 模式。这三点做到位，爬虫的可维护性和稳定性会有质的提升。

亚马逊云科技技术品牌专区

更多推荐

古风模特ai图片生成与多平台场景应用案例解析

随着人工智能在电商和视觉创作领域的不断发展，古风模特ai类应用逐步走进了主流内容制作流程，帮助众多创作者、商家快速实现高质量电商模特图与风格化图片需求。本文将从行业视角，结合具体产品，详细解析主流古风模特ai及其实际场景应用表现。通过多款产品的对比和案例分析，我会用最真实的体验分享这些工具在古风风格模特图生图及图片处理上的实用性和差异，为商用、创作等不同需求的从业者带来高价值参考。

亚马逊云科技技术品牌专区

WSaiOS认知内核：一种模块化可解释人工智能操作系统核心的设计与实现

亚马逊云科技技术品牌专区

CMU 10-423 生成式人工智能笔记（二）

本节课中我们一起学习了视觉语言模型的核心内容。我们首先了解了视觉语言模型的基本架构，即通过一个视觉编码器将图像转换为语言模型可处理的序列。基于VQ-VAE的编码器和基于CLIP的编码器。VQ-VAE通过向量量化将图像离散化为词元序列，支持图像生成；而CLIP通过对比学习得到连续的图像向量序列，语义对齐更好，但不支持直接图像生成。最后，我们认识到对于视觉语言模型乃至所有大模型而言，高质量、多样化的训