（python爬虫之）ajax解析爬取今日头条组图并下载

首先吐槽一波csdn配色真直男,囧此次任务需要注意的是：用request.get方法的时候要加cookie,网页源代码也改动了，详情页进入后不是一个json数据包,是一个html文档。以及其它我在源代码里备注了的地方等等。成功后返图：以下为源代码（包括我自己手动记录的一些难点）：import jsonfrom urllib.parse im...

亦是此间少年

407人浏览 · 2019-05-19 19:30:22

亦是此间少年 · 2019-05-19 19:30:22 发布

首先吐槽一波csdn配色真直男,囧

此次任务需要注意的是：

用request.get方法的时候要加cookie,

网页源代码也改动了，详情页进入后不是一个json数据包,是一个html文档。

以及其它我在源代码里备注了的地方等等。

成功后返图：

以下为源代码（包括我自己手动记录的一些难点）：

import json
from urllib.parse import urlencode
import requests
import re
import os

cookie = """这个我是不能粘贴的哈，自己登陆后用谷歌浏览器查看"""
header = {'User-agent': "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
'cookie': cookie}

# 请求索引页
def get_page_index(keyword, offset):
data = {"aid": 24,
"app_name": "web_search",
"offset":offset,
"format": "json",
"keyword": keyword,
"autoload": "true",
"count": "20",
"en_qc": 1,
"cur_tab": 1,
"from": "search_tab",
"pd": "synthesis",
"timestamp": 1557999858125}
url = "https://www.toutiao.com/api/search/content/?" + urlencode(data) #　这里用到urlib库的将参数转换为网址后缀
try:
r = requests.get(url, headers=header, timeout=20)
if r.status_code == 200:
# print(r.encoding) # 从httpheader推测的编码格式
# print(r.apparent_encoding) # 从内容中分析的解析编码格式
r.encoding = "json" #　定义对此文档的解析编码格式,这里上面两种方法返回的编码格式都不对，我自己试出来的。笑哭
return r.text
except:
print("索引页获取失败！")

# 解析索引页ｊｓｏｎ数据
def parse_page_index(html):
data = json.loads(html) # 此处将json数据转为字典
if data and 'data' in data.keys(): # data 是否存在，data的键值对是否有‘data’
for item in data.get('data'):
yield item.get('article_url') # 构造生成器。参考https://blog.csdn.net/mieleizhi0522/article/details/82142856

# 获取详情页
def get_page_detail(url):
try:
r = requests.get(url, headers=header)
return r.text
except:
print("获取详情页出错！")

# 处理详情页
def parse_page_detail(html):
gallery = re.findall(r"gallery: JSON.parse(.*?)siblingList", html, re.S) # r 是为了无视转义字符
if len(gallery) != 0:
for need_deal_gallery in gallery: # 因为提取到的是个只有一项的列表，所以需要遍历。
deal_gallery = re.sub(r"(\\)", '', need_deal_gallery) # 去掉字符串的转义字符，仍然是字符串
url_list = re.findall('(http.*?)"', deal_gallery, re.S) # 返回列表
return url_list

# 下载图片
def download(url):
root = "/home/kaixin/桌面/今日头条图片//"
path = root + url.split('/')[-1]
try:
if not os.path.exists(root):
os.mkdir(root)
if not os.path.exists(path):
r = requests.get(url, headers=header)
with open(path, 'wb') as f:
f.write(r.content)
print("保存成功！")
except:
print("保存失败！")

def main(key, pages):
i = 0
while i <= pages:
offset = 20 * i
i += 1
dict_html = get_page_index(key, offset)
all_url_list = []
for url in parse_page_index(dict_html):
if url != None: # 过滤不是网址的返回
html = get_page_detail(url)
url_lists = parse_page_detail(html)
if url_lists != None: # 过滤不是组图的返回。因为组图函数处理不了，函数return中那块不能过滤。
for lis in url_lists:
all_url_list.append(lis)
for url in all_url_list:
download(url)

if __name__ == '__main__':
main("街拍", 3) # 关键词，和页数

AtomGit 开源协作平台测评赛

瓜分20万奖金获得内推名额丰厚实物奖励易参与易上手

更多推荐

ADS1292R 使用过程心电图高精度ADC模块

文章目录1 Fundamentals ofPrecision ADC Noise Analysis 精密模数转换器噪声分析基础1 Fundamentals ofPrecision ADC Noise Analysis 精密模数转换器噪声分析基础https://www.ti.com.cn/cn/lit/wp/slyy192/slyy192.pdf?ts=1600659610730&ref_u

开放原子开发者工作坊

实现一个家庭安防与环境监测系统（一）

开放原子开发者工作坊

【cf】Codeforces Round #774 (Div. 2) 前4题

题目A. Square Counting 简单数学题目大意题解代码B. Quality vs Quantity 排序题目大意题解代码C. Factorials and Powers of Two 状态压缩dp+位运算题目大意题解代码D. Weight the Tree 树形dp+dfs题目大意题解代码E. Power Board 看起来像是数论？许多年没打cf了，偶尔打了一盘，恢复紫名了。A. S