🎯 场景:学校官网的列表翻页 + 图片批量下载
🛠 工具:requests + BeautifulSoup4 + 随机延时反爬
📦 产出:图片自动保存到指定文件夹,命名格式 序号-标题.jpg

一、项目背景

找到学生时代的作业,心血来潮重新又写了一遍。

关键还是定位标签,转成BeautifulSoup后,找到标签特征,不断通过find下探。把网页内容获取,网页内容解析,网页图片下载三个模块写成了三个方法,只要有下一页就不断下载。

找下一页时,发现下一页是部分替换,直接省下一大步。剩下的就是不断遍历,查找有没有下一页了

二、代码整体思路

步骤 函数名 作用
1️⃣ getCpageNpage(url) 请求当前页,解析HTML,提取下一页链接
2️⃣ getImageUrl(soup) 从当前页解析所有图片的 URL + 标题
3️⃣ downloadImage(page_url_dict, folder) 遍历字典,逐张下载图片到指定文件夹
🔄 while url: 循环 不断翻页,直到没有下一页为止

三、完整代码

import requests
from bs4 import BeautifulSoup
import os
from time import sleep
import random



def getCpageNpage(url):
    headers = {
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/148.0.0.0 Safari/537.36 Edg/148.0.0.0'
    }
    page = requests.get(url=url, headers=headers)
    # 设置编码,要不然文本会显示乱码
    page.encoding = "utf-8"
    soup = BeautifulSoup(page.text, 'html.parser')
    try:
        next_page_href = (soup.find('div', class_='right n_tupian')
        .find('div', class_='pb_sys_common pb_sys_normal pb_sys_style1')
        .find('span', 'p_next p_fun')
        .find('a')['href'])
        # 查看到当前网址与下一页网址的关系。当前网址最后一个"\"之后的内容进行替换,即可得到下一页网址
        replace_str = url.split("/")[-1]
        next_page_url = url.replace(replace_str, next_page_href)
    except Exception as e:
        next_page_url = None
    return soup, next_page_url


def getImageUrl(soup):
    div = soup.find('div', class_='right n_tupian')
    div_ul_li = div.find('ul').find_all('li')
    page_url_dict = {}
    for li in div_ul_li:
        title = li.find(class_='img').find("a")["title"]
        src = li.find(class_='img').find("img")["src"]
        page_url = "学校网址" + src
        # print(title, page_url)
        page_url_dict[page_url] = title
    return page_url_dict


def downloadImage(page_url_dict, folder="./images"):
    global index
    os.makedirs(folder, exist_ok=True)
    for img in page_url_dict:
        response = requests.get(img)
        # print(response.content)
        print(f"下载第 {index} 张{page_url_dict[img]}中,图片链接{img}")
        image_name = folder + "/" + str(index) + "-" + page_url_dict[img] + ".jpg"
        # 图片得用二进制字节流读取保存
        with open(image_name, 'wb') as f:
            f.write(response.content)
        index += 1
        sleep(round(random.uniform(0.5, 1), 2))


if __name__ == '__main__':
    url = '学校网址'
    folder = '保存路径'
    index = 1

    while url:
        print(url)
        try:
            soup, url = getCpageNpage(url)
        except Exception as e:
            soup = ""
            print("网页请求失败")
            print(e)
        try:
            page_url_dict = getImageUrl(soup)
        except Exception as e:
            page_url_dict = ""
            print("获取url失败")
            print(e)

        try:
             downloadImage(page_url_dict, folder)
        except Exception as e:
            print("下载失败")
            print(e)
        sleep(round(random.uniform(2, 4), 1))

更多推荐