手把手教你用 Decodo Scraper API 为 DeepSeek V4 智能体接入实时联网能力

主理人猫头虎微信: Libin9iOak

55276人浏览 · 2026-06-03 19:17:23

主理人猫头虎微信: Libin9iOak · 2026-06-03 19:17:23 发布

从零打造 DeepSeek V4 联网智能体：Decodo 助力完整集成教程

DeepSeek V4 是目前能力最强的开放权重模型之一。但和所有大语言模型一样，它存在训练数据截止时间——在数据采集之后发生的任何事情，它都一无所知。对于需要就当前价格、最新新闻、实时商品页面或其他任何真实世界数据进行推理的智能体来说，这个截止时间就是一道硬墙。本教程将完整演示如何借助 Decodo 网络抓取API 为 DeepSeek V4 智能体接入实时联网能力。

decodo

你将构建什么

一个能够完成以下任务的 Python 智能体：

在运行任何任务前，先验证凭据并确认 Decodo 连接畅通
通过Decodo 网络抓取 API抓取任意公开 URL，并获得干净的内容
通过 Decodo SERP API 执行实时 Google 搜索，并获得结构化结果
将这些内容直接传入 DeepSeek V4-Flash 或 V4-Pro 进行推理与输出

整套流程从零搭建大约只需 15 分钟。

前置准备

在写下第一行代码之前，你需要准备两样东西：

Decodo 账号和 API 令牌。搞一个 Decodo 控制台权限。登录后，进入网络抓取API 板块，开通订阅（提供free方案），然后在Basic authentication token 标签页中复制你的 API 令牌。

DeepSeek API 密钥。 创建一个DeepSeek key，并在控制台中生成一个 API 密钥。DeepSeek V4-Flash 是兼顾成本的默认选项，而 V4-Pro 则是能力更强的版本。

在终端中安装所需依赖：

pip install requests

请记得将凭据存储为环境变量，而不要把令牌硬编码到源文件中：

DECODO_TOKEN=“your_decodo_api_token”

DEEPSEEK_API_KEY=“your_deepseek_api_key”

第 0 步：先验证你的凭据

在构建任何东西之前，先确认你的 Decodo 令牌有效、API 可正常访问。这样能在一开始就发现鉴权问题，而不是等到流程进行到一半才报错。

import os

import requests

from dotenv import load_dotenv, find_dotenv



\# 1. Automatically search up the directory tree to find the .env file

load_dotenv(find_dotenv())



\# 2. Safely fetch the token

DECODO_TOKEN = os.getenv("DECODO_TOKEN")



if not DECODO_TOKEN:

raise ValueError("DECODO_TOKEN is missing. Please check your .env file.")



\# 3. Pass the raw token directly (No Base64 encoding needed)

DECODO_HEADERS = {

"accept": "application/json",

"content-type": "application/json",

"authorization": f"Basic {DECODO_TOKEN}",

}



def verify_credentials() -> bool:

"""

Hit the Decodo IP endpoint to confirm the token is valid

and the API is reachable. Returns True on success.

"""

try:

response = requests.post(

"https://scraper-api.decodo.com/v2/scrape",

json={"url": "https://ip.decodo.com/ip"},

headers=DECODO_HEADERS,

timeout=30,

)



response.raise_for_status() 



if response.status_code == 200:

ip = response.json()["results"][0]["content"]

print(f"Connection verified. Assigned IP: {ip.strip()}")

return True



except requests.exceptions.RequestException as e:

print(f"Network or Auth error occurred: {e}")

except KeyError:

print("Received an unexpected JSON structure from the API.")



return False



if __name__ == "__main__":

assert verify_credentials(), "Fix credentials before proceeding."

该 IP 端点（https://ip.decodo.com/ip）是 Decodo 自家的轻量级测试目标。它会返回一行内容，显示分配给你这次请求的出口 IP，这是在不抓取真实目标站点的情况下，确认令牌是否有效的最快方式。

第 1 步：用 Decodo 网络抓取 API抓取实时 URL

网页抓取天生就充满不确定性。你迟早会遇到限流或目标站点超时。为了稳妥应对，我们先创建一个带指数退避重试的健壮网络辅助函数，再用它构建核心的 URL 抓取器，使其能够请求由 JavaScript 渲染的页面。

import time



def _post_decodo(payload: dict, retries: int = 3, backoff: float = 2.0) -> dict:

"""Shared POST helper with exponential-backoff retry."""

delay = backoff

for attempt in range(retries):

try:

r = requests.post(

"https://scraper-api.decodo.com/v2/scrape",

json=payload,

headers=DECODO_HEADERS,

timeout=60,

)

r.raise_for_status()

return r.json()

except requests.HTTPError as exc:

if exc.response.status_code in (429, 524) and attempt < retries - 1:

print(f"HTTP {exc.response.status_code} — retry {attempt+1} in {delay}s")

time.sleep(delay)

delay *= 2

else:

raise



def scrape_url(

url: str,

javascript: bool = False,

geo: str = None,

) -> str:

"""

Fetch a public URL via the Decodo Web Scraping API.



Args:

url:        Target URL to scrape.

javascript: Set True for JS-rendered pages (Taobao, SPAs, etc.).

geo:        Route through a specific country, e.g. "China", "US".

Returns:

Raw HTML string of the target page.

"""

payload: dict = {"url": url}

if javascript:

payload["headless"] = "html"

if geo:

payload["geo"] = geo

return _post_decodo(payload)["results"][0]["content"]



\# Quick smoke test

html = scrape_url("https://ip.decodo.com/ip")

print(html.strip())  # Prints the assigned IP

响应结构

一次成功的 200 响应：

{

"results": [

{

"content": "Your Ip is: 213.87.163.6",

"status_code": 200,

"url": "https://ip.decodo.com/ip",

"task_id": "6971034977135771649",

"created_at": "2026-04-24 09:24:14",

"updated_at": "2026-04-24 09:24:17"

}

]

}

关键参数

url — 唯一必填的参数。可以是任意公开可访问的 URL。

headless — 设为"html" 即可启用完整的 JavaScript 渲染。对于淘宝商品页、重度依赖 JS 的仪表盘，以及任何在首次渲染后才动态加载价格或内容的页面，都必须启用。对静态 HTML 页面则应省略，因为它会增加延迟。

geo — 让请求通过指定国家的 IP 进行路由。接受国家名称：“China”、“United States”、“Germany” 等。

device_type — 可选。可接受的值：“desktop”（默认）、“mobile”、“desktop_chrome”、“mobile_android”。

不要在 payload 中加入 proxy_pool 或任何未在文档中说明的参数。它们对该端点并不是有效字段，轻则被悄悄忽略，重则导致意料之外的行为。Decodo API 会根据目标 URL 和你的订阅，自动选择合适的代理池。

第 2 步：用 Decodo SERP API执行实时 Google 搜索

用 Google 搜索可以让 AI 接触到时事，但搜索引擎的页面布局会随查询不断变化。我们必须采用防御式编程来安全地解析 JSON 响应，确保当 Google 返回的是知识面板（Knowledge Panel）而非标准自然结果链接时，智能体不会崩溃。

带防御式解析的 Google 搜索

import json



def google_search(

query: str,

geo: str = "China",

num_pages: int = 1,

) -> list[dict]:

"""

Run a Google Search via the Decodo SERP API.

Returns a list of organic result dicts (title, url, desc).

"""

payload = {

"target":    "google_search",

"query":     query,

"parse":     True,

"num_pages": num_pages,

"locale":    "zh-CN",

"geo":       geo,

}



response_data = _post_decodo(payload)



try:

results_array = response_data.get("results", [{}])

if not results_array:

return []



content_dict = results_array[0].get("content", {})

inner_results = content_dict.get("results", {})



organic_results = inner_results.get("organic", [])



if not organic_results:

formatted_json = json.dumps(content_dict, indent=2)

print(f"⚠️ Warning: No organic results found. API returned:\n{formatted_json}")



return organic_results



except Exception as e:

print(f"⚠️ Failed to parse Decodo Search JSON: {e}")

return []

### **解析后的结果结构**

{

"pos": 1,

"url": "https://example.com/deepseek-v4",

"title": "DeepSeek V4 benchmark results",

"desc": "DeepSeek V4-Pro scores top marks on MMLU and coding evals...",

"url_shown": "example.com",

"pos_overall": 1

}

向 LLM 传入 SERP 结果时，务必设置 parse: True。对于相同的信息，结构化 JSON 所消耗的 token 仅为原始 HTML 的一小部分。

其他受支持的 SERP 目标

baidu_search — 百度关键词搜索（返回 HTML，不支持解析）

google_shopping_search — Google 购物结果（可解析）

google_ads — 含付费广告的 Google 结果（可解析）

google_trends_explore — 某关键词的 Google Trends 数据（返回结构化 JSON）

第 3 步：在传给 DeepSeek V4 之前先清洗 HTML

大语言模型（LLM）按 token 计费。把充满结构标签、内联样式和跟踪脚本的原始 HTML 喂给它们，既浪费上下文窗口，又会降低推理质量。这个清洗步骤会剥离噪声，只留下 AI 真正需要的、具有语义的纯文本。

import re



def html_to_text(html: str, max_chars: int = 12_000) -> str:

"""

Strip HTML tags, remove script/style blocks, collapse whitespace.

Trims output to max_chars to control token consumption.

"""

html = re.sub(r'<(script|style)[^>]*>.*?</(script|style)>',

'', html, flags=re.DOTALL)

text = re.sub(r'<[^>]+>', ' ', html)

return re.sub(r'\s+', ' ', text).strip()[:max_chars]



\# For complex pages, BeautifulSoup gives better results:

\# pip install beautifulsoup4

\# from bs4 import BeautifulSoup

\# def html_to_text(html, max_chars=12_000):

\#     text = BeautifulSoup(html, "html.parser").get_text(" ", strip=True)

\#     return text[:max_chars]

第 4 步：调用 DeepSeek V4

DeepSeek V4 使用与 OpenAI 兼容的 completions 端点，因此集成起来极其简单。注意，我们特意把 temperature 设为0.2；把这个值保持得较低，会迫使模型进入确定性更强、更注重事实的模式，这对 RAG（检索增强生成）流程至关重要。

DEEPSEEK_API_KEY = os.environ["DEEPSEEK_API_KEY"]

DEEPSEEK_MODEL   = "deepseek-v4-flash"  # swap to "deepseek-v4-pro" as needed



DEEPSEEK_HEADERS = {

"authorization": f"Bearer {DEEPSEEK_API_KEY}",

"content-type":  "application/json",

}



def ask_deepseek(

system_prompt: str,

user_message: str,

max_tokens: int = 1024,

temperature: float = 0.2,

) -> str:

"""Send a prompt to DeepSeek V4 and return the response text."""

response = requests.post(

"https://api.deepseek.com/v1/chat/completions",

json={

"model":       DEEPSEEK_MODEL,

"messages": [

{"role": "system", "content": system_prompt},

{"role": "user",   "content": user_message},

],

"max_tokens":  max_tokens,

"temperature": temperature,

},

headers=DEEPSEEK_HEADERS,

timeout=60,

)

response.raise_for_status()

return response.json()["choices"][0]["message"]["content"]

第 5 步：完整的智能体

现在我们把整条流程串联起来。这个主函数接收用户的查询，通过 Decodo 动态抓取实时内容（既可以直接指定 URL，也可以执行 Google 搜索），并构建一个严格的上下文窗口。系统提示词明确要求 LLM 只能依据我们刚刚提供的实时事实来作答。

def web_agent(

question:     str,

url:          str | None = None,

search_query: str | None = None,

javascript:   bool = False,

geo:          str  = "China",

) -> str:

"""

DeepSeek V4 agent with live web access via Decodo.



Supply at least one of: url (scrape a specific page) or

search_query (run a Google Search). Both can be used together.

"""

if not url and not search_query:

raise ValueError("Provide at least one of: url or search_query")



context_parts: list[str] = []



if url:

print(f"Scraping {url} ...")

raw = scrape_url(url, javascript=javascript, geo=geo)

text = html_to_text(raw)

context_parts.append(f"[Page: {url}]\n{text}")



if search_query:

print(f"Searching: {search_query} ...")

results = google_search(search_query, geo=geo)

formatted = "\n".join(

f"{i + 1}. {res['title']}\n   {res['url']}\n   {res['desc']}"

for i, res in enumerate(results[:5])

)

context_parts.append(f"[Search: {search_query}]\n{formatted}")



context = "\n\n".join(context_parts)



return ask_deepseek(

system_prompt=(

"You are a precise research assistant. You have been given live "

"web content fetched right now. Answer the user's question using "

"only the provided content. Be specific and cite facts directly."

),

user_message=f"Content:\n{context}\n\nQuestion: {question}",

)





\# ── Usage examples ─────────────────────────────────────────────────────────



\# 1. Scrape a specific page

print(web_agent(

question="What scraping plans are available and what do they cost?",

url="https://decodo.cn/scraping/web/pricing",

))



\# 2. Search Google and reason over the top results

print(web_agent(

question="What are the key differences between DeepSeek V4-Pro and V4-Flash?",

search_query="DeepSeek V4-Pro vs V4-Flash benchmark 2026",

))



\# 3. Scrape a JS-rendered page (e.g. Taobao)

print(web_agent(

question="What is the current listed price for this product?",

url="https://item.taobao.com/item.htm?id=YOUR_ITEM_ID",

javascript=True,

))

HTTP 响应码

在生产环境中请显式处理这些响应码：

200 — 成功。内容位于results[0][“content”] 中。

204 — 请求已受理，但尚未完成。请等待后重试。

400 — payload 格式有误。请检查必填字段和参数名称。

401 — 令牌无效或缺失。请重新检查DECODO_TOKEN，必要时重新生成。

429 — 触发限流。请按指数退避后重试。

524 — 目标站点超时。请重试；如果页面需要 JS 渲染，可启用 headless: “html”。

完整的生产脚本

所有内容都在一个文件里。设置好环境变量，然后运行即可。

import os

import re

import time

import requests

import json

from dotenv import load_dotenv, find_dotenv



load_dotenv(find_dotenv())



DECODO_TOKEN     = os.getenv("DECODO_TOKEN")

DEEPSEEK_API_KEY = os.getenv("DEEPSEEK_API_KEY")

DEEPSEEK_MODEL   = "deepseek-v4-flash"



if not DECODO_TOKEN:

raise ValueError("DECODO_TOKEN is missing from your .env file.")

if not DEEPSEEK_API_KEY:

raise ValueError("DEEPSEEK_API_KEY is missing from your .env file.")



DECODO_HEADERS = {

"accept":        "application/json",

"content-type":  "application/json",

"authorization": f"Basic {DECODO_TOKEN}",

}



DEEPSEEK_HEADERS = {

"authorization": f"Bearer {DEEPSEEK_API_KEY}",

"content-type":  "application/json",

}



def _post_decodo(payload: dict, retries: int = 3, backoff: float = 2.0) -> dict:

"""Shared POST helper with exponential-backoff retry."""

delay = backoff

for attempt in range(retries):

try:

r = requests.post(

"https://scraper-api.decodo.com/v2/scrape",

json=payload,

headers=DECODO_HEADERS,

timeout=60,

)

r.raise_for_status()

return r.json()

except requests.HTTPError as exc:

if exc.response.status_code in (429, 524) and attempt < retries - 1:

print(f"HTTP {exc.response.status_code} — retry {attempt+1} in {delay}s")

time.sleep(delay)

delay *= 2

else:

raise



def verify_credentials() -> bool:

data = _post_decodo({"url": "https://ip.decodo.com/ip"})

ip = data["results"][0]["content"].strip()

print(f"Decodo OK — assigned IP: {ip}")

return True



def scrape_url(

url: str,

javascript: bool = False,

geo: str = None,

) -> str:

payload: dict = {"url": url}

if javascript:

payload["headless"] = "html"

if geo:

payload["geo"] = geo

return _post_decodo(payload)["results"][0]["content"]



def google_search(

query: str,

geo: str = "China",

num_pages: int = 1,

) -> list[dict]:

payload = {

"target":    "google_search",

"query":     query,

"parse":     True,

"num_pages": num_pages,

"locale":    "zh-CN",

"geo":       geo,

}



response_data = _post_decodo(payload)



try:

results_array = response_data.get("results", [{}])

if not results_array:

return []



content_dict = results_array[0].get("content", {})

inner_results = content_dict.get("results", {})



organic_results = inner_results.get("organic", [])



if not organic_results:

formatted_json = json.dumps(content_dict, indent=2)

print(f"⚠️ Warning: No organic results found. API returned:\n{formatted_json}")



return organic_results



except Exception as e:

print(f"⚠️ Failed to parse Decodo Search JSON: {e}")

return []



def html_to_text(html: str, max_chars: int = 12_000) -> str:

html = re.sub(r'<(script|style)[^>]*>.*?</(script|style)>',

'', html, flags=re.DOTALL)

text = re.sub(r'<[^>]+>', ' ', html)

return re.sub(r'\s+', ' ', text).strip()[:max_chars]



def ask_deepseek(

system_prompt: str,

user_message:  str,

max_tokens:    int   = 1024,

temperature:   float = 0.2,

) -> str:

r = requests.post(

"https://api.deepseek.com/v1/chat/completions",

json={

"model":       DEEPSEEK_MODEL,

"messages": [

{"role": "system", "content": system_prompt},

{"role": "user",   "content": user_message},

],

"max_tokens":  max_tokens,

"temperature": temperature,

},

headers=DEEPSEEK_HEADERS,

timeout=60,

)

r.raise_for_status()

return r.json()["choices"][0]["message"]["content"]



def web_agent(

question:     str,

url:          str | None = None,

search_query: str | None = None,

javascript:   bool = False,

geo:          str  = "China",

) -> str:

if not url and not search_query:

raise ValueError("Provide url or search_query")

parts: list[str] = []

if url:

parts.append(f"[Page: {url}]\n{html_to_text(scrape_url(url, javascript, geo))}")

if search_query:

res = google_search(search_query, geo=geo)

parts.append("[Search: {}]\n{}".format(

search_query,

"\n".join(

f"{i+1}. {x['title']}\n   {x['url']}\n   {x['desc']}"

for i, x in enumerate(res[:5])

),

))

return ask_deepseek(

"Answer using only the live web content below. Cite facts directly.",

f"Content:\n{chr(10).join(parts)}\n\nQuestion: {question}",

)



if __name__ == "__main__":

verify_credentials()



print(web_agent(

question="What are the latest DeepSeek V4 benchmark results?",

search_query="DeepSeek V4 benchmark results May 2026",

))

下一步可以做什么

加入记忆。 把抓取结果缓存到字典或 Redis 中，避免在同一会话内重复抓取相同的 URL。

加入结构化输出。 提示 DeepSeek V4 返回 JSON（价格、名称、日期）。设置 temperature: 0 即可获得确定性的格式。

加入多步推理。 让 V4 根据初步结果决定下一个要抓取的 URL，然后循环往复。百万 token 的上下文窗口让多轮链式调用变得切实可行。

用异步来扩展。把requests 换成httpx 和asyncio，以并发运行多个 Decodo 抓取任务——这对于一次性检查大量 SKU 的价格监控流程必不可少。

西安城市开发者社区

欢迎加入西安开发者社区！我们致力于为西安地区的开发者提供学习、合作和成长的机会。参与我们的活动，与专家分享最新技术趋势，解决挑战，探索创新。加入我们，共同打造技术社区！

更多推荐

城市级IP代理：赋能全球企业本地化数字增长与安全合规 | Decodo 德口多

西安城市开发者社区

淘宝天猫价格监控：电商价格实时追踪 | Decodo德口多官网

西安城市开发者社区

猫头虎AI赠书第14期赠书活动:《Claude Code技术架构深度解析：Harness工程与AI编程新范式》

HTTP 代理工作在 OSI 模型的应用层（第7层），专门处理 HTTP/HTTPS 流量。因为它能“读懂”经过的网络数据，所以可以过滤或修改数据包（比如请求头），从而调整连接信息（如地理位置、设备类型），甚至检测可疑流量。优点是设置简单（通常只需调整浏览器设置），并且支持缓存网页数据（图片、脚本、整页内容），加快重复访问速度、节省带宽。缺点是只能处理 HTTP/HTTPS 流量，无法像 SOCK