躲在被窝里偷偷学爬虫(5)---xpath练习
一,抓取彼岸图网的图片及对应名称彼岸图网(https://pic.netbian.com/index.html)打开检查,简单先分析一下网页内容目标已经确定,先抓取第一页内容,开始编写代码1,爬取整个页面内容并处理乱码# !/usr/bin/env python# -*- coding:utf-8 -*-# @Author:HePengLi# @Time:2021-03-26import requ
·
一,抓取彼岸图网的图片及对应名称
彼岸图网(https://pic.netbian.com/index.html)
打开检查,简单先分析一下网页内容
目标已经确定,先抓取第一页内容,开始编写代码
1,爬取整个页面内容并处理乱码
# !/usr/bin/env python
# -*- coding:utf-8 -*-
# @Author:HePengLi
# @Time:2021-03-26
import requests
from lxml import etree
def page_content(url):
content = requests.get(url=url, headers=headers).text
print(content)
if __name__ == '__main__':
url = 'https://pic.netbian.com/index.html'
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36 Edg/89.0.774.54"
}
page_content(url)
运行后出现乱码
常规套路处理乱码现象
def page_content(url):
content = requests.get(url=url, headers=headers)
content.encoding = 'utf-8'
content = content.text
print(content)
运行发现还是乱码,看来不吃utf-8这一套
那咱们换gbk或gb2312再看
def page_content(url):
content = requests.get(url=url, headers=headers)
# content.encoding = 'gbk'
content.encoding = 'gb2312'
content = content.text
print(content)
乱码搞定。爬虫就像博弈,见招拆招!!!
2,xpath解析内容
①拿到所有的li标签对象
def page_content(url):
content = requests.get(url=url, headers=headers)
# content.encoding = 'gbk'
content.encoding = 'gb2312'
content = content.text
# print(content)
tree = etree.HTML(content)
li_list = tree.xpath('//*[@id="main"]/div[3]/ul/li')
print(li_list)
②解析标题
def page_content(url):
content = requests.get(url=url, headers=headers)
# content.encoding = 'gbk'
content.encoding = 'gb2312'
content = content.text
# print(content)
tree = etree.HTML(content)
li_list = tree.xpath('//*[@id="main"]/div[3]/ul/li')
# print(li_list)
for li in li_list:
title = li.xpath('.//b/text()')[0]
print(title)
③解析图片对应链接并拼接成完整的
def page_content(url):
content = requests.get(url=url, headers=headers)
# content.encoding = 'gbk'
content.encoding = 'gb2312'
content = content.text
# print(content)
tree = etree.HTML(content)
li_list = tree.xpath('//*[@id="main"]/div[3]/ul/li')
# print(li_list)
for li in li_list:
title = li.xpath('.//b/text()')[0]
# print(title)
picture = li.xpath('.//img/@src')[0]
print(picture)
我们看到解析出来的路径不完整,所以进行拼接
def page_content(url):
content = requests.get(url=url, headers=headers)
# content.encoding = 'gbk'
content.encoding = 'gb2312'
content = content.text
# print(content)
tree = etree.HTML(content)
li_list = tree.xpath('//*[@id="main"]/div[3]/ul/li')
# print(li_list)
for li in li_list:
title = li.xpath('.//b/text()')[0]
# print(title)
picture = li.xpath('.//img/@src')[0]
picture = 'https://pic.netbian.com/' + picture
print(picture)
仔细观察完整的图片url地址颜色都变了,而且在pycharm中点击是可以跳转到网页打开图片的
④持久化存储到本地,且将图片的名字改为“标题.jpg”类型
# !/usr/bin/env python
# -*- coding:utf-8 -*-
# @Author:HePengLi
# @Time:2021-03-26
import requests
from lxml import etree
import os
if not os.path.exists('./4k_pictures'):
os.mkdir('./4k_pictures')
def page_content(url):
content = requests.get(url=url, headers=headers)
# content.encoding = 'gbk'
content.encoding = 'gb2312'
content = content.text
# print(content)
tree = etree.HTML(content)
li_list = tree.xpath('//*[@id="main"]/div[3]/ul/li')
# print(li_list)
for li in li_list:
title = li.xpath('.//b/text()')[0]
# print(title)
picture = li.xpath('.//img/@src')[0]
picture = 'https://pic.netbian.com/' + picture
# print(picture)
picture_content = requests.get(url=picture, headers=headers).content
picture_path = './4k_pictures/' + title + '.jpg'
with open(picture_path, 'wb') as f:
f.write(picture_content)
print(title, '\033[31;1m爬取完成!!!\033[0m')
if __name__ == '__main__':
url = 'https://pic.netbian.com/index.html'
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36 Edg/89.0.774.54"
}
page_content(url)
二,利用多线程爬取前500页图片🚀
上面尽管完成了一页的爬取量,但是如果要爬取前500页,小编觉得效率应该是及其低下,所以,咱们加上多线程测试一下
# !/usr/bin/env python
# -*- coding:utf-8 -*-
# @Author:HePengLi
# @Time:2021-03-26
import requests
from lxml import etree
import os
from concurrent.futures import ThreadPoolExecutor
import time
if not os.path.exists('./4k_pictures'):
os.mkdir('./4k_pictures')
def page_content(url):
content = requests.get(url=url, headers=headers)
# content.encoding = 'gbk'
content.encoding = 'gb2312'
content = content.text
# print(content)
tree = etree.HTML(content)
li_list = tree.xpath('//*[@id="main"]/div[3]/ul/li')
# print(li_list)
for li in li_list:
title = li.xpath('.//b/text()')[0]
# print(title)
picture = li.xpath('.//img/@src')[0]
picture = 'https://pic.netbian.com/' + picture
# print(picture)
picture_content = requests.get(url=picture, headers=headers).content
picture_path = './4k_pictures/' + title + '.jpg'
with open(picture_path, 'wb') as f:
f.write(picture_content)
print(title, '\033[31;1m爬取完成!!!\033[0m')
if __name__ == '__main__':
# url = 'https://pic.netbian.com/index.html'
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36 Edg/89.0.774.54"
}
user_input = int(input('总共1286页,请输入你想爬取的页数>>>:'))
start_time = time.time()
with ThreadPoolExecutor(100) as t:
for page in range(1, user_input +1):
if page == 1:
t.submit(page_content, 'https://pic.netbian.com/index%s.html' % "")
else:
t.submit(page_content, 'https://pic.netbian.com/index%s.html' % ('_'+ str(page)))
end_time = time.time()
print('\n总耗时%s秒!' % (end_time - start_time))
# page_content(url)
总共爬取前500页图片8966张,耗时305秒,也就是平均下来每秒爬取接近30张照片,看客朋友们觉得效率怎么样呢?
更多推荐
已为社区贡献2条内容
所有评论(0)