Python爬虫实战:批量下载校园风光图
·
🎯 场景:学校官网的列表翻页 + 图片批量下载
🛠 工具:requests + BeautifulSoup4 + 随机延时反爬
📦 产出:图片自动保存到指定文件夹,命名格式 序号-标题.jpg
一、项目背景
找到学生时代的作业,心血来潮重新又写了一遍。
关键还是定位标签,转成BeautifulSoup后,找到标签特征,不断通过find下探。把网页内容获取,网页内容解析,网页图片下载三个模块写成了三个方法,只要有下一页就不断下载。
找下一页时,发现下一页是部分替换,直接省下一大步。剩下的就是不断遍历,查找有没有下一页了
二、代码整体思路
| 步骤 | 函数名 | 作用 |
|---|---|---|
| 1️⃣ | getCpageNpage(url) |
请求当前页,解析HTML,提取下一页链接 |
| 2️⃣ | getImageUrl(soup) |
从当前页解析所有图片的 URL + 标题 |
| 3️⃣ | downloadImage(page_url_dict, folder) |
遍历字典,逐张下载图片到指定文件夹 |
| 🔄 | while url: 循环 |
不断翻页,直到没有下一页为止 |
三、完整代码
import requests
from bs4 import BeautifulSoup
import os
from time import sleep
import random
def getCpageNpage(url):
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/148.0.0.0 Safari/537.36 Edg/148.0.0.0'
}
page = requests.get(url=url, headers=headers)
# 设置编码,要不然文本会显示乱码
page.encoding = "utf-8"
soup = BeautifulSoup(page.text, 'html.parser')
try:
next_page_href = (soup.find('div', class_='right n_tupian')
.find('div', class_='pb_sys_common pb_sys_normal pb_sys_style1')
.find('span', 'p_next p_fun')
.find('a')['href'])
# 查看到当前网址与下一页网址的关系。当前网址最后一个"\"之后的内容进行替换,即可得到下一页网址
replace_str = url.split("/")[-1]
next_page_url = url.replace(replace_str, next_page_href)
except Exception as e:
next_page_url = None
return soup, next_page_url
def getImageUrl(soup):
div = soup.find('div', class_='right n_tupian')
div_ul_li = div.find('ul').find_all('li')
page_url_dict = {}
for li in div_ul_li:
title = li.find(class_='img').find("a")["title"]
src = li.find(class_='img').find("img")["src"]
page_url = "学校网址" + src
# print(title, page_url)
page_url_dict[page_url] = title
return page_url_dict
def downloadImage(page_url_dict, folder="./images"):
global index
os.makedirs(folder, exist_ok=True)
for img in page_url_dict:
response = requests.get(img)
# print(response.content)
print(f"下载第 {index} 张{page_url_dict[img]}中,图片链接{img}")
image_name = folder + "/" + str(index) + "-" + page_url_dict[img] + ".jpg"
# 图片得用二进制字节流读取保存
with open(image_name, 'wb') as f:
f.write(response.content)
index += 1
sleep(round(random.uniform(0.5, 1), 2))
if __name__ == '__main__':
url = '学校网址'
folder = '保存路径'
index = 1
while url:
print(url)
try:
soup, url = getCpageNpage(url)
except Exception as e:
soup = ""
print("网页请求失败")
print(e)
try:
page_url_dict = getImageUrl(soup)
except Exception as e:
page_url_dict = ""
print("获取url失败")
print(e)
try:
downloadImage(page_url_dict, folder)
except Exception as e:
print("下载失败")
print(e)
sleep(round(random.uniform(2, 4), 1))
更多推荐
所有评论(0)