【Agents篇】17：Agent 评估——基准测试与能力评测

本文深入探讨AI Agent评估体系，系统解析如何科学评测Agent的能力边界与实际表现。文章从评估必要性出发，分析核心挑战如环境复杂性和任务模糊性，详细介绍主流Benchmark（如AgentBench、ToolBench）的设计理念，并构建包含任务完成度、过程质量等维度的评估指标体系。同时提供Benchmark设计方法论和实战指南，帮助开发者构建自定义评估系统，最后展望Agent评估的未来发展

J_Xiong0117

1227人浏览 · 2026-02-04 10:29:45

J_Xiong0117 · 2026-02-04 10:29:45 发布

🎯 系列文章导航：本文是 Agents 系列的第 17 篇，聚焦于 AI Agent 的评估体系，深入探讨如何科学、系统地评测 Agent 的能力边界与实际表现。

1. 引言：为什么需要评估 Agent？ 🎯

在这里插入图片描述

在人工智能领域，评估始终是推动技术进步的核心驱动力之一。从 ImageNet 推动计算机视觉的飞跃，到 GLUE/SuperGLUE 促进自然语言理解的发展，每一个里程碑式的进步背后都有着科学、系统的评估体系作为支撑。

随着大语言模型（LLM）的能力不断增强，AI Agent（智能代理）已经从概念验证走向实际应用。Agent 不再仅仅是回答问题的"聊天机器人"，而是能够：

🔧 调用工具：使用 API、执行代码、操作文件
🌐 浏览网页：搜索信息、填写表单、完成在线任务
💻 操作系统：管理文件、执行命令、配置环境
🎮 玩游戏：理解规则、制定策略、做出决策
🤝 协作交互：与人类或其他 Agent 进行多轮对话

然而，一个关键问题随之而来：我们如何知道一个 Agent 到底有多"聪明"？

┌─────────────────────────────────────────────────────────────────┐
│                    Agent 评估的核心问题                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   传统 NLP 评估              vs         Agent 评估              │
│   ─────────────                        ─────────────            │
│   • 输入→输出                          • 多步骤推理              │
│   • 单一任务                           • 复合任务               │
│   • 静态数据                           • 动态环境               │
│   • 文本匹配                           • 结果验证               │
│   • 离线评测                           • 在线交互               │
│                                                                 │
│   Question: "What is 2+2?"             Task: "Book a flight     │
│   Answer: "4" ✓                              from NYC to LA     │
│                                              for next Monday"   │
│                                        Steps: Search → Select   │
│                                              → Fill → Pay → ✓   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

💡 思考：为什么传统的 NLP 评估方法无法直接应用于 Agent？

🤔 解答：传统 NLP 评估关注的是输入-输出映射的准确性，而 Agent 评估需要考虑：

过程的合理性：Agent 采取的步骤是否合理？
环境的动态性：同一任务可能因环境状态不同而有不同的最优解
结果的多样性：完成同一任务可能有多种正确路径
交互的复杂性：Agent 需要与外部世界持续交互

本文将深入探讨 Agent 评估领域的核心技术，包括主流 Benchmark 的设计思想、评估指标体系、以及如何构建科学有效的评测方案。

2. Agent 评估的核心挑战 🔥

在深入了解各类 Benchmark 之前，我们需要首先理解 Agent 评估面临的独特挑战。

2.1 环境复杂性

Agent 需要在复杂的环境中运行，这些环境可能是：

┌─────────────────────────────────────────────────────────────────┐
│                      Agent 运行环境类型                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   ┌──────────┐   ┌──────────┐   ┌──────────┐   ┌──────────┐    │
│   │  数据库   │   │   网页   │   │ 操作系统  │   │   游戏   │    │
│   │ Database │   │   Web    │   │    OS    │   │   Game   │    │
│   └────┬─────┘   └────┬─────┘   └────┬─────┘   └────┬─────┘    │
│        │              │              │              │          │
│        ▼              ▼              ▼              ▼          │
│   ┌─────────────────────────────────────────────────────────┐  │
│   │                      Agent Core                          │  │
│   │  ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐    │  │
│   │  │ 感知    │  │ 推理    │  │ 规划    │  │ 执行    │    │  │
│   │  │Perceive │→ │ Reason │→ │  Plan  │→ │ Execute │    │  │
│   │  └─────────┘  └─────────┘  └─────────┘  └─────────┘    │  │
│   └─────────────────────────────────────────────────────────┘  │
│        │              │              │              │          │
│        ▼              ▼              ▼              ▼          │
│   ┌──────────┐   ┌──────────┐   ┌──────────┐   ┌──────────┐    │
│   │   API    │   │  文件    │   │  终端    │   │  代码    │    │
│   │  Tools   │   │  Files   │   │ Terminal │   │   Code   │    │
│   └──────────┘   └──────────┘   └──────────┘   └──────────┘    │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

每种环境都有其独特的状态空间、动作空间和转移规则，这给评估带来了巨大挑战：

状态空间爆炸：网页环境可能有数百万种状态组合
动作空间无限：自然语言指令可以有无限种表达方式
部分可观察性：Agent 往往无法获取环境的完整信息

2.2 任务定义的模糊性

与传统任务不同，Agent 任务通常用自然语言描述，存在固有的模糊性：

# 任务描述示例
task_descriptions = {
    "simple": "计算 2 + 2",           # 明确，唯一答案
    "moderate": "找一家评分高的餐厅",   # 什么是"高"？哪个平台？
    "complex": "帮我规划一次旅行",      # 去哪？预算？时间？偏好？
    "ambiguous": "让这个代码更好",      # "更好"的定义？
}

💡 思考：如何处理任务定义的模糊性？

🤔 解答：通常有以下策略：

明确化：在 Benchmark 中提供详细的任务规范
多样化：接受多种合理的完成方式
交互化：允许 Agent 通过提问来澄清需求
层次化：将模糊任务分解为明确的子任务

2.3 评估标准的多维性

Agent 的表现不能用单一指标衡量，需要考虑多个维度：

┌─────────────────────────────────────────────────────────────────┐
│                    Agent 评估维度矩阵                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│         功能性              效率性              可靠性           │
│      ┌─────────┐        ┌─────────┐        ┌─────────┐         │
│      │ 完成率  │        │  步数   │        │ 一致性  │         │
│      │ 准确率  │        │  时间   │        │ 鲁棒性  │         │
│      │ 质量分  │        │  成本   │        │ 安全性  │         │
│      └─────────┘        └─────────┘        └─────────┘         │
│           │                  │                  │              │
│           └──────────────────┼──────────────────┘              │
│                              ▼                                 │
│                    ┌─────────────────┐                         │
│                    │   综合评估分数   │                         │
│                    │  Overall Score  │                         │
│                    └─────────────────┘                         │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

2.4 可复现性挑战

Agent 评估的可复现性面临多重挑战：

挑战类型	具体问题	应对策略
环境变化	网站更新、API 变更	使用快照、模拟环境
随机性	LLM 输出不确定	多次运行取平均
时效性	信息随时间变化	固定评测时间窗口
资源依赖	需要真实 API Key	提供模拟接口

3. 主流 Benchmark 深度解析 📊

3.1 AgentBench：多维度综合评测

AgentBench 是由清华大学等机构于 2023 年发布的综合性 Agent 评估基准，旨在全面评估 LLM 作为 Agent 的能力。

3.1.1 设计理念

AgentBench 的核心设计理念是多环境、多任务、真实交互：

┌─────────────────────────────────────────────────────────────────┐
│                    AgentBench 架构总览                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│                      ┌─────────────────┐                        │
│                      │   AgentBench    │                        │
│                      │   Controller    │                        │
│                      └────────┬────────┘                        │
│                               │                                 │
│         ┌─────────────────────┼─────────────────────┐          │
│         │                     │                     │          │
│         ▼                     ▼                     ▼          │
│  ┌─────────────┐      ┌─────────────┐      ┌─────────────┐     │
│  │  Code Env   │      │   Web Env   │      │  Game Env   │     │
│  │ ─────────── │      │ ─────────── │      │ ─────────── │     │
│  │ • OS        │      │ • Shopping  │      │ • ALFWorld  │     │
│  │ • DB        │      │ • Web Browse│      │ • SciWorld  │     │
│  │ • KG        │      │ • Mind2Web  │      │ • Jericho   │     │
│  └─────────────┘      └─────────────┘      └─────────────┘     │
│         │                     │                     │          │
│         └─────────────────────┼─────────────────────┘          │
│                               ▼                                 │
│                      ┌─────────────────┐                        │
│                      │  Unified Eval   │                        │
│                      │    Metrics      │                        │
│                      └─────────────────┘                        │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

3.1.2 八大评估环境

AgentBench 包含 8 个精心设计的评估环境：

1. 操作系统（OS）

# OS 环境任务示例
task_os = {
    "instruction": "Find all Python files in /home that were modified "
                   "in the last 7 days and count the total lines of code.",
    "environment": "Ubuntu 20.04 Docker Container",
    "tools": ["bash", "python"],
    "expected_output_type": "integer",
    "difficulty": "medium"
}

# Agent 需要执行的可能步骤：
# 1. find /home -name "*.py" -mtime -7
# 2. xargs wc -l
# 3. awk 计算总和

2. 数据库（DB）

# DB 环境任务示例
task_db = {
    "instruction": "Find the top 5 customers who spent the most money "
                   "in Q4 2023, including their total spending amount.",
    "database": "E-commerce SQLite DB",
    "schema": """
        customers(id, name, email, created_at)
        orders(id, customer_id, total_amount, order_date)
        order_items(id, order_id, product_id, quantity, price)
    """,
    "expected_output": "table with customer_name and total_spending"
}

# Agent 需要构建正确的 SQL 查询：
# SELECT c.name, SUM(o.total_amount) as total_spending
# FROM customers c
# JOIN orders o ON c.id = o.customer_id
# WHERE o.order_date BETWEEN '2023-10-01' AND '2023-12-31'
# GROUP BY c.id
# ORDER BY total_spending DESC
# LIMIT 5

3. 知识图谱（KG）

# KG 环境任务示例
task_kg = {
    "instruction": "Who directed the movie that won Best Picture at the "
                   "Oscars in the same year that the director of 'Inception' "
                   "was born?",
    "knowledge_base": "Freebase subset",
    "query_language": "SPARQL",
    "reasoning_steps": 4
}

# 需要多跳推理：
# Step 1: 找到 Inception 的导演 (Christopher Nolan)
# Step 2: 找到 Nolan 的出生年份 (1970)
# Step 3: 找到 1970 年奥斯卡最佳影片 (Patton)
# Step 4: 找到 Patton 的导演 (Franklin J. Schaffner)

4. 数字卡牌游戏（DCG）

# 数字卡牌游戏任务示例
task_dcg = {
    "game": "Lateral Thinking Puzzle Card Game",
    "objective": "Deduce the hidden rule from given examples",
    "examples": [
        {"input": "3, 5, 7", "valid": True},
        {"input": "2, 4, 6", "valid": False},
        {"input": "9, 11, 13", "valid": True}
    ],
    "test_cases": ["1, 3, 5", "4, 8, 12", "7, 9, 11"]
}

5. 家居环境（ALFWorld）

# ALFWorld 任务示例
task_alfworld = {
    "instruction": "Put a clean apple in the fridge",
    "environment": "Simulated household",
    "available_actions": [
        "go to {location}",
        "take {object} from {location}",
        "put {object} in/on {location}",
        "open {container}",
        "close {container}",
        "clean {object} with {tool}",
        "examine {object}"
    ],
    "initial_state": "You are in the kitchen."
}

# Agent 执行序列示例：
# > go to countertop 1
# > take apple 1 from countertop 1
# > go to sinkbasin 1
# > clean apple 1 with sinkbasin 1
# > go to fridge 1
# > open fridge 1
# > put apple 1 in fridge 1

6. 科学世界（ScienceWorld）

# ScienceWorld 任务示例
task_sciworld = {
    "instruction": "Determine which object has the highest density",
    "objects": ["wood block", "iron cube", "plastic ball"],
    "available_tools": ["scale", "ruler", "water tank"],
    "expected_method": "measure mass and volume, calculate density"
}

7. 文字冒险游戏（Jericho）

# Jericho 游戏片段（Zork I）
You are standing in an open field west of a white house, 
with a boarded front door.
There is a small mailbox here.

> open mailbox
Opening the small mailbox reveals a leaflet.

> take leaflet
Taken.

> read leaflet
"WELCOME TO ZORK!
ZORK is a game of adventure, danger, and low cunning..."

8. 网页浏览（WebShop/WebArena）

这部分将在 3.3 节详细介绍。

3.1.3 评估框架实现

import json
from abc import ABC, abstractmethod
from dataclasses import dataclass
from typing import Dict, List, Any, Optional
from enum import Enum

class EnvironmentType(Enum):
    """AgentBench 支持的环境类型"""
    OS = "operating_system"
    DB = "database"
    KG = "knowledge_graph"
    DCG = "digital_card_game"
    ALFWORLD = "alfworld"
    SCIWORLD = "scienceworld"
    JERICHO = "jericho"
    WEB = "web_browsing"

@dataclass
class AgentAction:
    """Agent 动作的标准化表示"""
    action_type: str
    content: str
    timestamp: float
    metadata: Dict[str, Any] = None

@dataclass
class EnvironmentState:
    """环境状态的标准化表示"""
    observation: str
    available_actions: List[str]
    done: bool
    reward: float
    info: Dict[str, Any]

class BaseEnvironment(ABC):
    """所有评估环境的基类"""
    
    def __init__(self, config: Dict[str, Any]):
        self.config = config
        self.history: List[tuple] = []  # (action, state) pairs
        self.step_count = 0
        self.max_steps = config.get("max_steps", 30)
        
    @abstractmethod
    def reset(self, task: Dict[str, Any]) -> EnvironmentState:
        """重置环境到初始状态"""
        pass
    
    @abstractmethod
    def step(self, action: AgentAction) -> EnvironmentState:
        """执行动作并返回新状态"""
        pass
    
    @abstractmethod
    def evaluate(self) -> Dict[str, float]:
        """评估任务完成情况"""
        pass
    
    def is_done(self) -> bool:
        """检查是否终止"""
        return self.step_count >= self.max_steps

class OSEnvironment(BaseEnvironment):
    """操作系统环境实现"""
    
    def __init__(self, config: Dict[str, Any]):
        super().__init__(config)
        self.docker_client = None  # Docker 客户端
        self.container = None
        
    def reset(self, task: Dict[str, Any]) -> EnvironmentState:
        """初始化 Docker 容器"""
        self.task = task
        self.step_count = 0
        self.history = []
        
        # 创建隔离的 Docker 环境
        # self.container = self._create_container(task)
        
        initial_observation = f"""
You are in a Linux environment. Your task is:
{task['instruction']}

You can execute bash commands to complete this task.
Type your command:
"""
        return EnvironmentState(
            observation=initial_observation,
            available_actions=["bash command"],
            done=False,
            reward=0.0,
            info={"task_id": task.get("id")}
        )
    
    def step(self, action: AgentAction) -> EnvironmentState:
        """执行 bash 命令"""
        self.step_count += 1
        
        # 在容器中执行命令
        # result = self.container.exec_run(action.content)
        # 模拟执行结果
        result = self._simulate_execution(action.content)
        
        self.history.append((action, result))
        
        return EnvironmentState(
            observation=result,
            available_actions=["bash command"],
            done=self.is_done(),
            reward=0.0,
            info={"step": self.step_count}
        )
    
    def _simulate_execution(self, command: str) -> str:
        """模拟命令执行（实际应在 Docker 中执行）"""
        return f"[Simulated output for: {command}]"
    
    def evaluate(self) -> Dict[str, float]:
        """评估任务完成情况"""
        # 检查最终输出是否符合预期
        if not self.history:
            return {"success": 0.0, "partial_score": 0.0}
        
        # 实际评估逻辑
        final_output = self.history[-1][1]
        expected = self.task.get("expected_output", "")
        
        # 简化的评估（实际会更复杂）
        success = 1.0 if expected in final_output else 0.0
        
        return {
            "success": success,
            "steps_used": self.step_count,
            "efficiency": max(0, 1 - self.step_count / self.max_steps)
        }

class DatabaseEnvironment(BaseEnvironment):
    """数据库环境实现"""
    
    def __init__(self, config: Dict[str, Any]):
        super().__init__(config)
        self.connection = None
        
    def reset(self, task: Dict[str, Any]) -> EnvironmentState:
        """初始化数据库连接"""
        self.task = task
        self.step_count = 0
        self.history = []
        
        schema_info = task.get("schema", "No schema provided")
        
        initial_observation = f"""
You are connected to a SQL database. Your task is:
{task['instruction']}

Database Schema:
{schema_info}

Enter your SQL query:
"""
        return EnvironmentState(
            observation=initial_observation,
            available_actions=["SQL query"],
            done=False,
            reward=0.0,
            info={"task_id": task.get("id")}
        )
    
    def step(self, action: AgentAction) -> EnvironmentState:
        """执行 SQL 查询"""
        self.step_count += 1
        
        # 执行查询
        # result = self._execute_query(action.content)
        result = f"[Query result for: {action.content[:50]}...]"
        
        self.history.append((action, result))
        
        return EnvironmentState(
            observation=result,
            available_actions=["SQL query"],
            done=self.is_done(),
            reward=0.0,
            info={"step": self.step_count}
        )
    
    def evaluate(self) -> Dict[str, float]:
        """评估查询结果"""
        # 比较查询结果与标准答案
        return {"success": 0.0, "partial_score": 0.0}

class AgentBenchRunner:
    """AgentBench 评测运行器"""
    
    def __init__(self, config_path: str):
        with open(config_path) as f:
            self.config = json.load(f)
        
        self.environments: Dict[EnvironmentType, BaseEnvironment] = {}
        self._init_environments()
        
    def _init_environments(self):
        """初始化所有评估环境"""
        env_configs = self.config.get("environments", {})
        
        env_mapping = {
            EnvironmentType.OS: OSEnvironment,
            EnvironmentType.DB: DatabaseEnvironment,
            # ... 其他环境
        }
        
        for env_type, env_class in env_mapping.items():
            if env_type.value in env_configs:
                self.environments[env_type] = env_class(
                    env_configs[env_type.value]
                )
    
    def run_evaluation(
        self,
        agent,  # Agent 实例
        env_type: EnvironmentType,
        tasks: List[Dict[str, Any]],
        num_runs: int = 3
    ) -> Dict[str, Any]:
        """运行评估"""
        
        env = self.environments.get(env_type)
        if not env:
            raise ValueError(f"Environment {env_type} not initialized")
        
        results = []
        
        for task in tasks:
            task_results = []
            
            for run in range(num_runs):
                # 重置环境
                state = env.reset(task)
                
                # Agent 交互循环
                while not state.done:
                    # Agent 决策
                    action = agent.act(state.observation)
                    
                    # 环境步进
                    state = env.step(action)
                
                # 评估
                eval_result = env.evaluate()
                task_results.append(eval_result)
            
            # 聚合多次运行结果
            avg_result = self._aggregate_results(task_results)
            results.append({
                "task_id": task.get("id"),
                "results": avg_result
            })
        
        return {
            "environment": env_type.value,
            "num_tasks": len(tasks),
            "results": results,
            "aggregate": self._compute_aggregate(results)
        }
    
    def _aggregate_results(
        self, 
        results: List[Dict[str, float]]
    ) -> Dict[str, float]:
        """聚合多次运行结果"""
        if not results:
            return {}
        
        keys = results[0].keys()
        return {
            key: sum(r[key] for r in results) / len(results)
            for key in keys
        }
    
    def _compute_aggregate(
        self, 
        results: List[Dict[str, Any]]
    ) -> Dict[str, float]:
        """计算总体统计"""
        success_rates = [
            r["results"].get("success", 0) 
            for r in results
        ]
        
        return {
            "mean_success_rate": sum(success_rates) / len(success_rates),
            "num_succeeded": sum(1 for s in success_rates if s > 0.5),
            "num_total": len(results)
        }

3.1.4 AgentBench 评估结果分析

根据原论文发布的评估结果，不同 LLM 在 AgentBench 上表现差异显著：

┌─────────────────────────────────────────────────────────────────┐
│              AgentBench 评估结果（部分模型）                      │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  模型              │  OS   │  DB   │  KG   │ ALF  │  总分       │
│  ─────────────────────────────────────────────────────────────  │
│  GPT-4            │ 42.4  │ 32.0  │ 28.0  │ 78.0 │  4.41       │
│  GPT-3.5-turbo    │ 22.0  │ 20.0  │ 12.0  │ 15.0 │  2.26       │
│  Claude-2         │ 28.4  │ 26.0  │ 16.0  │ 22.0 │  2.74       │
│  LLaMA-2-70B      │ 8.8   │ 10.0  │ 8.0   │ 4.0  │  0.84       │
│  CodeLlama-34B    │ 12.4  │ 12.0  │ 6.0   │ 2.0  │  0.95       │
│  Vicuna-33B       │ 6.4   │ 8.0   │ 4.0   │ 2.0  │  0.59       │
│                                                                 │
│  注：总分为加权平均，各环境权重不同                               │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

💡 思考：为什么开源模型与闭源模型差距如此之大？

🤔 解答：主要原因包括：

指令遵循能力：闭源模型经过更多 RLHF 训练
代码能力：Agent 任务需要大量代码生成
推理能力：多步骤任务需要强大的推理链
上下文长度：复杂任务需要处理更长的历史
工具使用训练：可能经过专门的工具使用训练

3.2 ToolBench：工具使用能力评测

ToolBench 是专注于评估 LLM 工具调用能力的 Benchmark，由清华大学于 2023 年发布。

3.2.1 设计动机

现代 AI Agent 的核心能力之一是调用外部工具来完成任务。ToolBench 旨在回答：

LLM 能否正确理解 API 文档？
能否为复杂任务选择合适的 API？
能否正确构造 API 调用参数？
能否组合多个 API 完成复合任务？

┌─────────────────────────────────────────────────────────────────┐
│                    ToolBench 架构设计                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │                   Task Instructions                      │   │
│   │  "Find trending tech news and summarize the top 3"      │   │
│   └─────────────────────────────────────────────────────────┘   │
│                              │                                  │
│                              ▼                                  │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │                    Tool Retriever                        │   │
│   │  ┌─────────┐  ┌─────────┐  ┌─────────┐                  │   │
│   │  │ News API│  │Search   │  │Summary  │  ... (16k+ APIs) │   │
│   │  │         │  │API      │  │API      │                  │   │
│   │  └─────────┘  └─────────┘  └─────────┘                  │   │
│   └─────────────────────────────────────────────────────────┘   │
│                              │                                  │
│                              ▼                                  │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │                    LLM Agent                             │   │
│   │  ┌─────────────────────────────────────────────────┐    │   │
│   │  │ Thought: I need to first get trending tech news │    │   │
│   │  │ Action: news_api.get_trending(category="tech")  │    │   │
│   │  │ Observation: [{"title": "AI Breakthrough",...}] │    │   │
│   │  │ Thought: Now I'll summarize the top 3...        │    │   │
│   │  └─────────────────────────────────────────────────┘    │   │
│   └─────────────────────────────────────────────────────────┘   │
│                              │                                  │
│                              ▼                                  │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │                    Evaluation                            │   │
│   │  • Pass Rate  • Win Rate  • API Accuracy               │   │
│   └─────────────────────────────────────────────────────────┘   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

3.2.2 数据集构建

ToolBench 的数据集构建流程：

from typing import Dict, List, Any, Optional
from dataclasses import dataclass, field
import json

@dataclass
class APISchema:
    """API 接口的标准化表示"""
    name: str
    description: str
    category: str
    base_url: str
    endpoints: List[Dict[str, Any]]
    authentication: Optional[Dict[str, str]] = None
    rate_limit: Optional[int] = None

@dataclass
class ToolBenchTask:
    """ToolBench 任务定义"""
    task_id: str
    instruction: str
    category: str  # 单工具 / 多工具-单类别 / 多工具-多类别
    required_apis: List[str]
    difficulty: str  # easy / medium / hard
    ground_truth_path: Optional[List[Dict]] = None
    
    # 元数据
    metadata: Dict[str, Any] = field(default_factory=dict)

class ToolBenchDatasetBuilder:
    """ToolBench 数据集构建器"""
    
    def __init__(self, rapidapi_key: str):
        self.api_key = rapidapi_key
        self.api_pool: Dict[str, APISchema] = {}
        self.tasks: List[ToolBenchTask] = []
        
    def crawl_apis(self, categories: List[str]) -> None:
        """从 RapidAPI 爬取 API 信息"""
        
        for category in categories:
            # 爬取该类别下的所有 API
            apis = self._fetch_apis_by_category(category)
            
            for api in apis:
                schema = self._parse_api_schema(api)
                if self._is_valid_api(schema):
                    self.api_pool[schema.name] = schema
        
        print(f"Collected {len(self.api_pool)} valid APIs")
    
    def _fetch_apis_by_category(self, category: str) -> List[Dict]:
        """获取指定类别的 API 列表"""
        # 实际实现会调用 RapidAPI
        return []
    
    def _parse_api_schema(self, raw_api: Dict) -> APISchema:
        """解析 API 原始数据为标准 schema"""
        return APISchema(
            name=raw_api.get("name", ""),
            description=raw_api.get("description", ""),
            category=raw_api.get("category", ""),
            base_url=raw_api.get("base_url", ""),
            endpoints=raw_api.get("endpoints", [])
        )
    
    def _is_valid_api(self, schema: APISchema) -> bool:
        """验证 API 是否可用"""
        # 检查必要字段
        if not schema.name or not schema.endpoints:
            return False
        # 检查是否有文档
        if not schema.description:
            return False
        return True
    
    def generate_tasks(
        self, 
        task_types: List[str],
        num_tasks_per_type: int = 100
    ) -> None:
        """使用 LLM 生成评测任务"""
        
        for task_type in task_types:
            if task_type == "single_tool":
                self._generate_single_tool_tasks(num_tasks_per_type)
            elif task_type == "multi_tool_single_category":
                self._generate_multi_tool_single_cat_tasks(num_tasks_per_type)
            elif task_type == "multi_tool_multi_category":
                self._generate_multi_tool_multi_cat_tasks(num_tasks_per_type)
    
    def _generate_single_tool_tasks(self, num_tasks: int) -> None:
        """生成单工具任务"""
        
        prompt_template = """
Given the following API documentation, generate a realistic user 
query that would require using this API to answer.

API Name: {api_name}
Description: {api_description}
Endpoints: {endpoints}

Requirements:
1. The query should be natural and specific
2. It should require calling at least one endpoint
3. Provide difficulty level (easy/medium/hard)

Output format:
{{
    "instruction": "...",
    "required_endpoints": [...],
    "difficulty": "..."
}}
"""
        
        for api_name, api_schema in list(self.api_pool.items())[:num_tasks]:
            # 调用 LLM 生成任务
            # task_json = self._call_llm(prompt_template.format(...))
            
            task = ToolBenchTask(
                task_id=f"single_{api_name}_{len(self.tasks)}",
                instruction="[Generated instruction]",
                category="single_tool",
                required_apis=[api_name],
                difficulty="medium"
            )
            self.tasks.append(task)
    
    def _generate_multi_tool_single_cat_tasks(self, num_tasks: int) -> None:
        """生成多工具单类别任务"""
        # 选择同一类别的多个 API 组合
        pass
    
    def _generate_multi_tool_multi_cat_tasks(self, num_tasks: int) -> None:
        """生成多工具多类别任务"""
        # 选择不同类别的多个 API 组合
        pass
    
    def export_dataset(self, output_path: str) -> None:
        """导出数据集"""
        dataset = {
            "api_pool": {
                name: {
                    "name": schema.name,
                    "description": schema.description,
                    "category": schema.category,
                    "endpoints": schema.endpoints
                }
                for name, schema in self.api_pool.items()
            },
            "tasks": [
                {
                    "task_id": task.task_id,
                    "instruction": task.instruction,
                    "category": task.category,
                    "required_apis": task.required_apis,
                    "difficulty": task.difficulty
                }
                for task in self.tasks
            ]
        }
        
        with open(output_path, 'w') as f:
            json.dump(dataset, f, indent=2)

3.2.3 评测流程与指标

from typing import Dict, List, Any, Tuple
from dataclasses import dataclass
import re

@dataclass
class ToolCallResult:
    """工具调用结果"""
    tool_name: str
    endpoint: str
    parameters: Dict[str, Any]
    response: Any
    success: bool
    error_message: Optional[str] = None

@dataclass
class EvaluationResult:
    """评估结果"""
    task_id: str
    success: bool
    api_calls: List[ToolCallResult]
    final_answer: str
    metrics: Dict[str, float]

class ToolBenchEvaluator:
    """ToolBench 评估器"""
    
    def __init__(self, api_pool: Dict[str, APISchema]):
        self.api_pool = api_pool
        self.results: List[EvaluationResult] = []
        
    def evaluate_response(
        self,
        task: ToolBenchTask,
        agent_response: str,
        execution_trace: List[ToolCallResult]
    ) -> EvaluationResult:
        """评估单个任务的响应"""
        
        metrics = {}
        
        # 1. 计算 Pass Rate（任务是否成功完成）
        metrics["pass"] = self._compute_pass_rate(
            task, agent_response, execution_trace
        )
        
        # 2. 计算 API 选择准确率
        metrics["api_selection_accuracy"] = self._compute_api_accuracy(
            task.required_apis,
            [call.tool_name for call in execution_trace]
        )
        
        # 3. 计算参数正确率
        metrics["parameter_accuracy"] = self._compute_param_accuracy(
            execution_trace
        )
        
        # 4. 计算效率（调用次数）
        metrics["efficiency"] = self._compute_efficiency(
            task, execution_trace
        )
        
        result = EvaluationResult(
            task_id=task.task_id,
            success=metrics["pass"] > 0.5,
            api_calls=execution_trace,
            final_answer=agent_response,
            metrics=metrics
        )
        
        self.results.append(result)
        return result
    
    def _compute_pass_rate(
        self,
        task: ToolBenchTask,
        response: str,
        trace: List[ToolCallResult]
    ) -> float:
        """计算任务通过率"""
        
        # 检查是否所有必需的 API 都被调用
        called_apis = set(call.tool_name for call in trace if call.success)
        required_apis = set(task.required_apis)
        
        if not required_apis.issubset(called_apis):
            return 0.0
        
        # 检查是否有成功的最终响应
        if not response or "error" in response.lower():
            return 0.0
        
        # 可以添加更复杂的答案验证逻辑
        return 1.0
    
    def _compute_api_accuracy(
        self,
        required: List[str],
        called: List[str]
    ) -> float:
        """计算 API 选择准确率"""
        
        if not required:
            return 1.0 if not called else 0.0
        
        required_set = set(required)
        called_set = set(called)
        
        # Precision: 调用的 API 中有多少是必需的
        precision = len(required_set & called_set) / len(called_set) \
                    if called_set else 0
        
        # Recall: 必需的 API 中有多少被调用
        recall = len(required_set & called_set) / len(required_set)
        
        # F1 Score
        if precision + recall == 0:
            return 0.0
        return 2 * precision * recall / (precision + recall)
    
    def _compute_param_accuracy(
        self,
        trace: List[ToolCallResult]
    ) -> float:
        """计算参数正确率"""
        
        if not trace:
            return 0.0
        
        correct_calls = sum(1 for call in trace if call.success)
        return correct_calls / len(trace)
    
    def _compute_efficiency(
        self,
        task: ToolBenchTask,
        trace: List[ToolCallResult]
    ) -> float:
        """计算效率分数"""
        
        # 最优解假设需要的调用次数
        optimal_calls = len(task.required_apis)
        actual_calls = len(trace)
        
        if actual_calls == 0:
            return 0.0
        
        # 效率分数：最优 / 实际（上限为 1）
        return min(1.0, optimal_calls / actual_calls)
    
    def compute_aggregate_metrics(self) -> Dict[str, float]:
        """计算聚合指标"""
        
        if not self.results:
            return {}
        
        metrics = {
            "overall_pass_rate": sum(
                r.metrics["pass"] for r in self.results
            ) / len(self.results),
            
            "avg_api_accuracy": sum(
                r.metrics["api_selection_accuracy"] for r in self.results
            ) / len(self.results),
            
            "avg_param_accuracy": sum(
                r.metrics["parameter_accuracy"] for r in self.results
            ) / len(self.results),
            
            "avg_efficiency": sum(
                r.metrics["efficiency"] for r in self.results
            ) / len(self.results),
            
            "success_count": sum(1 for r in self.results if r.success),
            "total_count": len(self.results)
        }
        
        return metrics

class ToolLLMAgent:
    """ToolLLM Agent 实现"""
    
    def __init__(
        self,
        model_name: str,
        api_pool: Dict[str, APISchema],
        retriever = None
    ):
        self.model_name = model_name
        self.api_pool = api_pool
        self.retriever = retriever
        
    def solve_task(
        self,
        task: ToolBenchTask
    ) -> Tuple[str, List[ToolCallResult]]:
        """解决任务"""
        
        # 1. 检索相关 API
        relevant_apis = self._retrieve_apis(task.instruction)
        
        # 2. 构建 prompt
        prompt = self._build_prompt(task.instruction, relevant_apis)
        
        # 3. 迭代执行
        execution_trace = []
        max_iterations = 10
        
        for _ in range(max_iterations):
            # 调用 LLM 获取下一步动作
            response = self._call_llm(prompt)
            
            # 解析动作
            action = self._parse_action(response)
            
            if action["type"] == "finish":
                return action["answer"], execution_trace
            
            elif action["type"] == "api_call":
                # 执行 API 调用
                result = self._execute_api_call(
                    action["api_name"],
                    action["endpoint"],
                    action["parameters"]
                )
                execution_trace.append(result)
                
                # 更新 prompt
                prompt += f"\nObservation: {result.response}"
        
        return "Task incomplete", execution_trace
    
    def _retrieve_apis(
        self,
        instruction: str,
        top_k: int = 5
    ) -> List[APISchema]:
        """检索相关 API"""
        
        if self.retriever:
            return self.retriever.retrieve(instruction, top_k)
        
        # 简单关键词匹配（实际应使用向量检索）
        scores = []
        for name, schema in self.api_pool.items():
            score = sum(
                1 for word in instruction.lower().split()
                if word in schema.description.lower()
            )
            scores.append((schema, score))
        
        scores.sort(key=lambda x: x[1], reverse=True)
        return [s[0] for s in scores[:top_k]]
    
    def _build_prompt(
        self,
        instruction: str,
        apis: List[APISchema]
    ) -> str:
        """构建 ReAct 风格的 prompt"""
        
        api_docs = "\n\n".join([
            f"API: {api.name}\n"
            f"Description: {api.description}\n"
            f"Endpoints: {json.dumps(api.endpoints, indent=2)}"
            for api in apis
        ])
        
        prompt = f"""You are a helpful assistant that can use APIs to help 
users complete tasks.

Available APIs:
{api_docs}

User Request: {instruction}

Please solve this task step by step. At each step, you can:
1. Think about what to do next
2. Call an API using the format:
   Action: api_name.endpoint(param1=value1, param2=value2)
3. Or provide the final answer:
   Action: Finish(answer="your final answer")

Let's begin:

Thought: """
        
        return prompt
    
    def _call_llm(self, prompt: str) -> str:
        """调用 LLM"""
        # 实际实现会调用 OpenAI/Anthropic 等 API
        return "[LLM Response]"
    
    def _parse_action(self, response: str) -> Dict[str, Any]:
        """解析 LLM 响应中的动作"""
        
        # 匹配 Finish 动作
        finish_match = re.search(
            r'Finish\(answer="(.+?)"\)', response
        )
        if finish_match:
            return {
                "type": "finish",
                "answer": finish_match.group(1)
            }
        
        # 匹配 API 调用
        api_match = re.search(
            r'Action:\s*(\w+)\.(\w+)\((.+?)\)', response
        )
        if api_match:
            api_name = api_match.group(1)
            endpoint = api_match.group(2)
            params_str = api_match.group(3)
            
            # 解析参数
            params = self._parse_parameters(params_str)
            
            return {
                "type": "api_call",
                "api_name": api_name,
                "endpoint": endpoint,
                "parameters": params
            }
        
        return {"type": "unknown"}
    
    def _parse_parameters(self, params_str: str) -> Dict[str, Any]:
        """解析参数字符串"""
        params = {}
        # 简化实现
        for param in params_str.split(","):
            if "=" in param:
                key, value = param.split("=", 1)
                params[key.strip()] = value.strip().strip('"\'')
        return params
    
    def _execute_api_call(
        self,
        api_name: str,
        endpoint: str,
        parameters: Dict[str, Any]
    ) -> ToolCallResult:
        """执行 API 调用"""
        
        # 实际实现会调用真实 API
        return ToolCallResult(
            tool_name=api_name,
            endpoint=endpoint,
            parameters=parameters,
            response={"status": "success", "data": "..."},
            success=True
        )

3.2.4 ToolBench 关键发现

研究发现了以下重要结论：

┌─────────────────────────────────────────────────────────────────┐
│                ToolBench 关键发现                                │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  1. API 检索是关键瓶颈                                           │
│     ┌────────────────────────────────────────────┐              │
│     │  正确 API 检索    → Pass Rate ≈ 65%        │              │
│     │  错误 API 检索    → Pass Rate ≈ 12%        │              │
│     └────────────────────────────────────────────┘              │
│                                                                 │
│  2. 任务复杂度影响显著                                           │
│     ┌────────────────────────────────────────────┐              │
│     │  单工具任务        Pass Rate: 71.2%        │              │
│     │  多工具单类别      Pass Rate: 48.5%        │              │
│     │  多工具多类别      Pass Rate: 32.1%        │              │
│     └────────────────────────────────────────────┘              │
│                                                                 │
│  3. 参数构造是主要错误来源                                       │
│     ┌────────────────────────────────────────────┐              │
│     │  错误类型          占比                     │              │
│     │  ────────────────────────                  │              │
│     │  参数格式错误       38%                     │              │
│     │  参数缺失          25%                     │              │
│     │  API 选择错误      22%                     │              │
│     │  逻辑错误          15%                     │              │
│     └────────────────────────────────────────────┘              │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

3.3 WebArena：真实网页交互评测

WebArena 是由卡内基梅隆大学于 2023 年发布的网页 Agent 评估基准，专注于评估 Agent 在真实网页环境中执行复杂任务的能力。

3.3.1 设计特点

WebArena 的核心创新在于提供了完全可复现的真实网页环境：

┌─────────────────────────────────────────────────────────────────┐
│                    WebArena 环境架构                             │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │                    Docker Compose                        │   │
│   │  ┌──────────────────────────────────────────────────┐   │   │
│   │  │              Self-hosted Websites                 │   │   │
│   │  │                                                   │   │   │
│   │  │  ┌──────────┐  ┌──────────┐  ┌──────────┐       │   │   │
│   │  │  │ Shopping │  │   CMS    │  │  Reddit  │       │   │   │
│   │  │  │(OneStop- │  │(Postmill)│  │  Clone   │       │   │   │
│   │  │  │  Shop)   │  │          │  │          │       │   │   │
│   │  │  └──────────┘  └──────────┘  └──────────┘       │   │   │
│   │  │                                                   │   │   │
│   │  │  ┌──────────┐  ┌──────────┐  ┌──────────┐       │   │   │
│   │  │  │  GitLab  │  │Wikipedia │  │   Map    │       │   │   │
│   │  │  │  Clone   │  │  Clone   │  │(OpenStreet)│     │   │   │
│   │  │  │          │  │          │  │          │       │   │   │
│   │  │  └──────────┘  └──────────┘  └──────────┘       │   │   │
│   │  │                                                   │   │   │
│   │  └──────────────────────────────────────────────────┘   │   │
│   │                          │                              │   │
│   │                          ▼                              │   │
│   │  ┌──────────────────────────────────────────────────┐   │   │
│   │  │              Playwright Controller                │   │   │
│   │  │  • Browser Automation                            │   │   │
│   │  │  • State Snapshot                                │   │   │
│   │  │  • Action Execution                              │   │   │
│   │  └──────────────────────────────────────────────────┘   │   │
│   └─────────────────────────────────────────────────────────┘   │
│                              │                                  │
│                              ▼                                  │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │                    Agent Interface                       │   │
│   │  • Observation: Accessibility Tree / Screenshot / HTML  │   │
│   │  • Action Space: click, type, scroll, select, etc.      │   │
│   └─────────────────────────────────────────────────────────┘   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

3.3.2 网站环境详解

WebArena 包含 6 个功能完整的网站：

1. Shopping（电商）- OneStopShop

shopping_tasks = [
    {
        "id": "shopping_001",
        "instruction": "Find a red dress under $50 with at least 4-star "
                       "rating and add the cheapest one to cart",
        "site": "shopping",
        "eval_type": "state_check",
        "expected_state": {
            "cart_items": [
                {"category": "dress", "color": "red", "price": "<50"}
            ]
        }
    },
    {
        "id": "shopping_002",
        "instruction": "Find the order I placed last month and request "
                       "a return for the damaged item",
        "site": "shopping",
        "eval_type": "action_sequence",
        "required_actions": ["login", "orders", "select_order", "return"]
    }
]

2. CMS（内容管理）- Postmill

cms_tasks = [
    {
        "id": "cms_001",
        "instruction": "Create a new post in the 'Technology' forum with "
                       "title 'AI Updates' and body 'Latest AI news'",
        "site": "cms",
        "eval_type": "content_check",
        "expected_content": {
            "post_exists": True,
            "title_match": "AI Updates",
            "forum": "Technology"
        }
    }
]

3. Reddit Clone

reddit_tasks = [
    {
        "id": "reddit_001",
        "instruction": "Find the most upvoted post this week in r/programming "
                       "and leave a comment",
        "site": "reddit",
        "eval_type": "combined"
    }
]

4. GitLab Clone

gitlab_tasks = [
    {
        "id": "gitlab_001",
        "instruction": "Create a new issue in project 'web-app' titled "
                       "'Bug: Login fails' with label 'bug' and assign to me",
        "site": "gitlab",
        "eval_type": "api_verify"
    },
    {
        "id": "gitlab_002",
        "instruction": "Find all open merge requests that have been pending "
                       "for more than a week and add a comment requesting review",
        "site": "gitlab",
        "eval_type": "multi_step"
    }
]

5. Wikipedia Clone

wiki_tasks = [
    {
        "id": "wiki_001",
        "instruction": "Find the population of France according to the "
                       "Wikipedia article and report the number",
        "site": "wikipedia",
        "eval_type": "string_match"
    }
]

6. Map（OpenStreetMap）

map_tasks = [
    {
        "id": "map_001",
        "instruction": "Find the shortest driving route from CMU to "
                       "Pittsburgh Airport and tell me the estimated time",
        "site": "map",
        "eval_type": "fuzzy_match"
    }
]

3.3.3 Agent 实现

from playwright.async_api import async_playwright, Page, Browser
from typing import Dict, List, Any, Optional, Tuple
from dataclasses import dataclass
from enum import Enum
import asyncio
import base64

class ActionType(Enum):
    """WebArena 支持的动作类型"""
    CLICK = "click"
    TYPE = "type"
    SCROLL = "scroll"
    SELECT = "select"
    HOVER = "hover"
    PRESS = "press"
    GO_BACK = "go_back"
    GO_FORWARD = "go_forward"
    GOTO = "goto"
    STOP = "stop"  # 任务完成

@dataclass
class WebAction:
    """网页动作表示"""
    action_type: ActionType
    element_id: Optional[str] = None
    text: Optional[str] = None
    url: Optional[str] = None
    direction: Optional[str] = None  # up/down for scroll
    key: Optional[str] = None  # for press action

@dataclass
class WebObservation:
    """网页观察状态"""
    url: str
    title: str
    accessibility_tree: str
    screenshot_base64: Optional[str] = None
    html_content: Optional[str] = None
    interactive_elements: List[Dict[str, Any]] = None

class WebArenaEnvironment:
    """WebArena 环境实现"""
    
    def __init__(self, config: Dict[str, Any]):
        self.config = config
        self.browser: Optional[Browser] = None
        self.page: Optional[Page] = None
        self.step_count = 0
        self.max_steps = config.get("max_steps", 30)
        self.history: List[Tuple[WebAction, WebObservation]] = []
        
    async def setup(self) -> None:
        """初始化浏览器"""
        playwright = await async_playwright().start()
        self.browser = await playwright.chromium.launch(
            headless=self.config.get("headless", True)
        )
        self.page = await self.browser.new_page()
        
        # 设置视口
        await self.page.set_viewport_size({
            "width": 1280,
            "height": 720
        })
        
    async def reset(self, task: Dict[str, Any]) -> WebObservation:
        """重置环境到任务初始状态"""
        self.task = task
        self.step_count = 0
        self.history = []
        
        # 导航到初始页面
        start_url = self._get_start_url(task["site"])
        await self.page.goto(start_url)
        
        # 如果需要登录
        if task.get("requires_login", False):
            await self._perform_login(task["site"])
        
        return await self._get_observation()
    
    def _get_start_url(self, site: str) -> str:
        """获取站点起始 URL"""
        site_urls = {
            "shopping": "http://localhost:7770",
            "cms": "http://localhost:7771",
            "reddit": "http://localhost:7772",
            "gitlab": "http://localhost:7773",
            "wikipedia": "http://localhost:7774",
            "map": "http://localhost:7775"
        }
        return site_urls.get(site, "http://localhost:7770")
    
    async def _perform_login(self, site: str) -> None:
        """执行登录操作"""
        credentials = self.config.get("credentials", {}).get(site, {})
        # 实际登录逻辑...
        pass
    
    async def _get_observation(self) -> WebObservation:
        """获取当前页面观察"""
        
        # 获取可访问性树
        accessibility_tree = await self._get_accessibility_tree()
        
        # 获取截图（可选）
        screenshot = None
        if self.config.get("include_screenshot", False):
            screenshot_bytes = await self.page.screenshot()
            screenshot = base64.b64encode(screenshot_bytes).decode()
        
        # 获取交互元素
        interactive_elements = await self._get_interactive_elements()
        
        return WebObservation(
            url=self.page.url,
            title=await self.page.title(),
            accessibility_tree=accessibility_tree,
            screenshot_base64=screenshot,
            interactive_elements=interactive_elements
        )
    
    async def _get_accessibility_tree(self) -> str:
        """获取页面可访问性树"""
        
        # 使用 Playwright 获取可访问性快照
        snapshot = await self.page.accessibility.snapshot()
        
        def format_node(node: Dict, indent: int = 0) -> str:
            """递归格式化节点"""
            lines = []
            prefix = "  " * indent
            
            role = node.get("role", "")
            name = node.get("name", "")
            node_id = node.get("id", "")
            
            line = f"{prefix}[{node_id}] {role}"
            if name:
                line += f' "{name}"'
            
            # 添加额外属性
            if node.get("focused"):
                line += " (focused)"
            if node.get("disabled"):
                line += " (disabled)"
            
            lines.append(line)
            
            # 递归处理子节点
            for child in node.get("children", []):
                lines.append(format_node(child, indent + 1))
            
            return "\n".join(lines)
        
        return format_node(snapshot) if snapshot else ""
    
    async def _get_interactive_elements(self) -> List[Dict[str, Any]]:
        """获取页面上的交互元素"""
        
        elements = await self.page.evaluate("""
            () => {
                const interactiveSelectors = [
                    'a', 'button', 'input', 'select', 'textarea',
                    '[role="button"]', '[role="link"]', '[role="textbox"]',
                    '[onclick]', '[tabindex]'
                ];
                
                const elements = [];
                
                interactiveSelectors.forEach(selector => {
                    document.querySelectorAll(selector).forEach((el, idx) => {
                        const rect = el.getBoundingClientRect();
                        if (rect.width > 0 && rect.height > 0) {
                            elements.push({
                                id: `elem_${elements.length}`,
                                tag: el.tagName.toLowerCase(),
                                text: el.innerText?.slice(0, 100) || '',
                                placeholder: el.placeholder || '',
                                type: el.type || '',
                                href: el.href || '',
                                x: rect.x,
                                y: rect.y,
                                width: rect.width,
                                height: rect.height
                            });
                        }
                    });
                });
                
                return elements;
            }
        """)
        
        return elements
    
    async def step(self, action: WebAction) -> WebObservation:
        """执行动作"""
        self.step_count += 1
        
        try:
            if action.action_type == ActionType.CLICK:
                await self._execute_click(action)
            elif action.action_type == ActionType.TYPE:
                await self._execute_type(action)
            elif action.action_type == ActionType.SCROLL:
                await self._execute_scroll(action)
            elif action.action_type == ActionType.SELECT:
                await self._execute_select(action)
            elif action.action_type == ActionType.HOVER:
                await self._execute_hover(action)
            elif action.action_type == ActionType.PRESS:
                await self._execute_press(action)
            elif action.action_type == ActionType.GO_BACK:
                await self.page.go_back()
            elif action.action_type == ActionType.GO_FORWARD:
                await self.page.go_forward()
            elif action.action_type == ActionType.GOTO:
                await self.page.goto(action.url)
        except Exception as e:
            print(f"Action failed: {e}")
        
        observation = await self._get_observation()
        self.history.append((action, observation))
        
        return observation
    
    async def _execute_click(self, action: WebAction) -> None:
        """执行点击"""
        if action.element_id:
            # 通过元素 ID 定位
            element = await self._find_element_by_id(action.element_id)
            if element:
                await element.click()
    
    async def _execute_type(self, action: WebAction) -> None:
        """执行输入"""
        if action.element_id and action.text:
            element = await self._find_element_by_id(action.element_id)
            if element:
                await element.fill(action.text)
    
    async def _execute_scroll(self, action: WebAction) -> None:
        """执行滚动"""
        direction = action.direction or "down"
        delta = 500 if direction == "down" else -500
        await self.page.mouse.wheel(0, delta)
    
    async def _execute_select(self, action: WebAction) -> None:
        """执行选择"""
        if action.element_id and action.text:
            element = await self._find_element_by_id(action.element_id)
            if element:
                await element.select_option(label=action.text)
    
    async def _execute_hover(self, action: WebAction) -> None:
        """执行悬停"""
        if action.element_id:
            element = await self._find_element_by_id(action.element_id)
            if element:
                await element.hover()
    
    async def _execute_press(self, action: WebAction) -> None:
        """执行按键"""
        if action.key:
            await self.page.keyboard.press(action.key)
    
    async def _find_element_by_id(self, element_id: str):
        """通过 ID 找到元素"""
        # 实际实现会更复杂
        return None
    
    def is_done(self) -> bool:
        """检查是否完成"""
        return self.step_count >= self.max_steps
    
    async def cleanup(self) -> None:
        """清理资源"""
        if self.browser:
            await self.browser.close()

class WebArenaAgent:
    """WebArena Agent 实现"""
    
    def __init__(
        self,
        llm_client,
        observation_mode: str = "accessibility_tree"  # or "screenshot"
    ):
        self.llm = llm_client
        self.observation_mode = observation_mode
        
    async def act(
        self,
        task: Dict[str, Any],
        observation: WebObservation
    ) -> WebAction:
        """基于观察生成动作"""
        
        prompt = self._build_prompt(task, observation)
        
        response = await self.llm.generate(prompt)
        
        action = self._parse_action(response)
        
        return action
    
    def _build_prompt(
        self,
        task: Dict[str, Any],
        observation: WebObservation
    ) -> str:
        """构建 prompt"""
        
        prompt = f"""You are a web navigation agent. Your task is to help 
the user complete their request by interacting with a web browser.

## Task
{task['instruction']}

## Current Page
URL: {observation.url}
Title: {observation.title}

## Page Content (Accessibility Tree)
{observation.accessibility_tree}

## Available Actions
- click(element_id): Click on an element
- type(element_id, text): Type text into an input field  
- scroll(direction): Scroll up or down
- select(element_id, option): Select an option from dropdown
- hover(element_id): Hover over an element
- press(key): Press a keyboard key
- go_back(): Go to previous page
- go_forward(): Go to next page
- goto(url): Navigate to a URL
- stop(answer): Task completed, provide final answer if needed

## Rules
1. Analyze the current page state carefully
2. Identify the next logical step to complete the task
3. Output exactly ONE action in the format: action_name(params)
4. If the task is complete, use stop(answer)

## Your Response
Think step by step, then provide your action:
"""
        return prompt
    
    def _parse_action(self, response: str) -> WebAction:
        """解析 LLM 响应为动作"""
        
        import re
        
        # 匹配各种动作格式
        patterns = {
            "click": r"click\(([^)]+)\)",
            "type": r"type\(([^,]+),\s*['\"](.+?)['\"]\)",
            "scroll": r"scroll\((up|down)\)",
            "select": r"select\(([^,]+),\s*['\"](.+?)['\"]\)",
            "hover": r"hover\(([^)]+)\)",
            "press": r"press\(['\"](.+?)['\"]\)",
            "go_back": r"go_back\(\)",
            "go_forward": r"go_forward\(\)",
            "goto": r"goto\(['\"](.+?)['\"]\)",
            "stop": r"stop\((?:['\"](.+?)['\"])?\)"
        }
        
        for action_name, pattern in patterns.items():
            match = re.search(pattern, response, re.IGNORECASE)
            if match:
                if action_name == "click":
                    return WebAction(
                        action_type=ActionType.CLICK,
                        element_id=match.group(1).strip("'\"")
                    )
                elif action_name == "type":
                    return WebAction(
                        action_type=ActionType.TYPE,
                        element_id=match.group(1).strip("'\""),
                        text=match.group(2)
                    )
                elif action_name == "scroll":
                    return WebAction(
                        action_type=ActionType.SCROLL,
                        direction=match.group(1)
                    )
                elif action_name == "select":
                    return WebAction(
                        action_type=ActionType.SELECT,
                        element_id=match.group(1).strip("'\""),
                        text=match.group(2)
                    )
                elif action_name == "hover":
                    return WebAction(
                        action_type=ActionType.HOVER,
                        element_id=match.group(1).strip("'\"")
                    )
                elif action_name == "press":
                    return WebAction(
                        action_type=ActionType.PRESS,
                        key=match.group(1)
                    )
                elif action_name == "go_back":
                    return WebAction(action_type=ActionType.GO_BACK)
                elif action_name == "go_forward":
                    return WebAction(action_type=ActionType.GO_FORWARD)
                elif action_name == "goto":
                    return WebAction(
                        action_type=ActionType.GOTO,
                        url=match.group(1)
                    )
                elif action_name == "stop":
                    return WebAction(
                        action_type=ActionType.STOP,
                        text=match.group(1) if match.group(1) else ""
                    )
        
        # 默认返回停止动作
        return WebAction(action_type=ActionType.STOP)

class WebArenaEvaluator:
    """WebArena 评估器"""
    
    def __init__(self, config: Dict[str, Any]):
        self.config = config
        
    async def evaluate_task(
        self,
        task: Dict[str, Any],
        env: WebArenaEnvironment,
        final_answer: Optional[str] = None
    ) -> Dict[str, Any]:
        """评估任务完成情况"""
        
        eval_type = task.get("eval_type", "state_check")
        
        if eval_type == "state_check":
            return await self._evaluate_state_check(task, env)
        elif eval_type == "string_match":
            return self._evaluate_string_match(task, final_answer)
        elif eval_type == "fuzzy_match":
            return self._evaluate_fuzzy_match(task, final_answer)
        elif eval_type == "url_check":
            return self._evaluate_url_check(task, env)
        elif eval_type == "content_check":
            return await self._evaluate_content_check(task, env)
        else:
            return {"success": False, "error": "Unknown eval type"}
    
    async def _evaluate_state_check(
        self,
        task: Dict[str, Any],
        env: WebArenaEnvironment
    ) -> Dict[str, Any]:
        """检查页面状态"""
        
        expected_state = task.get("expected_state", {})
        
        # 例如检查购物车
        if "cart_items" in expected_state:
            # 调用页面 API 或检查 DOM
            actual_cart = await self._get_cart_items(env)
            expected_cart = expected_state["cart_items"]
            
            # 比较购物车内容
            success = self._compare_cart(actual_cart, expected_cart)
            return {"success": success, "actual": actual_cart}
        
        return {"success": False}
    
    def _evaluate_string_match(
        self,
        task: Dict[str, Any],
        answer: str
    ) -> Dict[str, Any]:
        """精确字符串匹配"""
        
        expected = task.get("expected_answer", "")
        success = expected.lower().strip() in answer.lower().strip()
        
        return {"success": success, "expected": expected, "actual": answer}
    
    def _evaluate_fuzzy_match(
        self,
        task: Dict[str, Any],
        answer: str
    ) -> Dict[str, Any]:
        """模糊匹配"""
        
        expected = task.get("expected_answer", "")
        
        # 使用相似度计算
        similarity = self._compute_similarity(expected, answer)
        success = similarity > 0.8
        
        return {
            "success": success,
            "similarity": similarity,
            "expected": expected,
            "actual": answer
        }
    
    def _evaluate_url_check(
        self,
        task: Dict[str, Any],
        env: WebArenaEnvironment
    ) -> Dict[str, Any]:
        """检查最终 URL"""
        
        expected_url_pattern = task.get("expected_url_pattern", "")
        actual_url = env.page.url
        
        import re
        success = bool(re.match(expected_url_pattern, actual_url))
        
        return {"success": success, "actual_url": actual_url}
    
    async def _evaluate_content_check(
        self,
        task: Dict[str, Any],
        env: WebArenaEnvironment
    ) -> Dict[str, Any]:
        """检查页面内容"""
        
        expected_content = task.get("expected_content", {})
        results = {}
        
        for key, expected_value in expected_content.items():
            actual_value = await self._extract_content(env, key)
            results[key] = {
                "expected": expected_value,
                "actual": actual_value,
                "match": self._content_matches(expected_value, actual_value)
            }
        
        success = all(r["match"] for r in results.values())
        return {"success": success, "details": results}
    
    async def _get_cart_items(self, env: WebArenaEnvironment) -> List:
        """获取购物车商品"""
        return []
    
    def _compare_cart(self, actual: List, expected: List) -> bool:
        """比较购物车"""
        return True
    
    def _compute_similarity(self, s1: str, s2: str) -> float:
        """计算字符串相似度"""
        from difflib import SequenceMatcher
        return SequenceMatcher(None, s1.lower(), s2.lower()).ratio()
    
    async def _extract_content(self, env: WebArenaEnvironment, key: str):
        """提取页面内容"""
        return None
    
    def _content_matches(self, expected, actual) -> bool:
        """内容匹配"""
        if isinstance(expected, bool):
            return expected == actual
        if isinstance(expected, str):
            return expected.lower() in str(actual).lower()
        return expected == actual

3.3.4 WebArena 评估结果

┌─────────────────────────────────────────────────────────────────┐
│              WebArena 评估结果（2024年数据）                      │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  模型                    │  成功率  │  平均步数  │  效率          │
│  ────────────────────────────────────────────────────────────   │
│  GPT-4 + SoM             │  14.4%   │   18.2    │  0.42         │
│  GPT-4V (Vision)         │  12.8%   │   15.7    │  0.48         │
│  Claude-3 Opus           │  11.2%   │   16.4    │  0.45         │
│  GPT-3.5-turbo          │   4.5%   │   21.3    │  0.31         │
│  LLaMA-2-70B            │   2.1%   │   24.8    │  0.22         │
│                                                                 │
│  人类基准                 │  78.2%   │    8.4    │  0.89         │
│                                                                 │
│  关键发现：                                                      │
│  • 最强模型与人类差距仍超过 60%                                   │
│  • 视觉输入略有帮助但提升有限                                     │
│  • 长序列任务显著更难                                             │
│  • 需要推理的任务成功率更低                                       │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

💡 思考：为什么 WebArena 的成功率如此之低？

🤔 解答：主要原因包括：

页面理解困难：真实网页结构复杂，难以准确理解
动作空间巨大：每个页面可能有数百个可交互元素
错误积累：一步错误可能导致整个任务失败
上下文局限：难以在长对话中保持任务状态
缺乏恢复能力：遇到意外情况难以自我修正

3.4 其他重要 Benchmark

除了上述三个主要 Benchmark，还有许多其他重要的 Agent 评估基准：

3.4.1 Benchmark 全景图

┌─────────────────────────────────────────────────────────────────┐
│                    Agent Benchmark 全景图                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐             │
│  │    综合     │  │   代码/OS   │  │    网页     │             │
│  │ ─────────── │  │ ─────────── │  │ ─────────── │             │
│  │ AgentBench  │  │ SWE-bench   │  │ WebArena    │             │
│  │ GAIA       │  │ InterCode   │  │ Mind2Web    │             │
│  │ τ-bench    │  │ AgentCoder  │  │ WebShop     │             │
│  └─────────────┘  └─────────────┘  └─────────────┘             │
│                                                                 │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐             │
│  │   工具调用   │  │    推理     │  │    对话     │             │
│  │ ─────────── │  │ ─────────── │  │ ─────────── │             │
│  │ ToolBench   │  │ ReasonBench │  │ MT-Bench   │             │
│  │ API-Bank    │  │ BIG-bench   │  │ AlpacaEval │             │
│  │ ToolAlpaca  │  │ ARC         │  │ LMSYS Arena│             │
│  └─────────────┘  └─────────────┘  └─────────────┘             │
│                                                                 │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐             │
│  │   多模态    │  │   游戏/具身  │  │   安全     │             │
│  │ ─────────── │  │ ─────────── │  │ ─────────── │             │
│  │ VisualWebAr │  │ ALFWorld    │  │ SafetyBench│             │
│  │ OSWorld     │  │ Minedojo    │  │ TrustLLM   │             │
│  │ ScreenSpot  │  │ Voyager     │  │ RedTeaming │             │
│  └─────────────┘  └─────────────┘  └─────────────┘             │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

3.4.2 SWE-bench：软件工程评测

SWE-bench 专注于评估 Agent 解决真实 GitHub Issue 的能力：

# SWE-bench 任务示例
swe_bench_task = {
    "instance_id": "django__django-11099",
    "repo": "django/django",
    "base_commit": "abc123...",
    "problem_statement": """
    UsernameValidator allows trailing newlines in usernames.
    
    Description:
    The ASCIIUsernameValidator and UnicodeUsernameValidator 
    use the regex r'^[\\w.@+-]+$' which should reject strings 
    with a trailing newline, but due to the $ anchor behavior, 
    it actually allows them.
    
    Steps to reproduce:
    >>> from django.contrib.auth.validators import ASCIIUsernameValidator
    >>> v = ASCIIUsernameValidator()
    >>> v("username\\n")  # Should raise ValidationError but doesn't
    """,
    "hints": [],
    "test_patch": "...",  # 用于验证修复的测试
    "expected_fix": {
        "file": "django/contrib/auth/validators.py",
        "type": "regex_fix"
    }
}

3.4.3 GAIA：通用 AI 助手评测

GAIA 评估 Agent 完成复杂现实任务的能力：

# GAIA 任务示例（不同难度级别）
gaia_tasks = {
    "level_1": {
        "question": "What is the capital of the country where the 2024 "
                    "Olympics will be held?",
        "answer": "Paris",
        "skills": ["web_search", "knowledge_retrieval"]
    },
    "level_2": {
        "question": "Download the CSV file from this link and compute "
                    "the average value of the 'price' column for items "
                    "in category 'electronics'.",
        "answer": "299.50",
        "skills": ["file_download", "data_processing", "computation"]
    },
    "level_3": {
        "question": "Using the company's public financial reports, "
                    "calculate the year-over-year revenue growth rate "
                    "for the past 3 years and create a trend visualization.",
        "answer": "[image or detailed report]",
        "skills": ["document_analysis", "calculation", "visualization"]
    }
}

3.4.4 τ-bench：真实用户交互评测

τ-bench (Tau-bench) 模拟真实的客服/助手交互场景：

# τ-bench 对话示例
tau_bench_conversation = {
    "scenario": "airline_customer_service",
    "user_persona": {
        "name": "Alex",
        "context": "Has a flight booked for tomorrow, needs to change to "
                   "a different date due to work conflict",
        "hidden_constraints": [
            "Budget limited to $100 change fee",
            "Can only travel on weekends"
        ]
    },
    "agent_role": "Customer service representative with access to booking system",
    "success_criteria": [
        "Agent identifies the correct booking",
        "Agent finds suitable alternative flights",
        "Agent respects user's hidden constraints",
        "Agent completes the change successfully"
    ],
    "conversation_turns": [
        {"user": "Hi, I need to change my flight tomorrow"},
        {"agent": "I'd be happy to help. May I have your booking reference?"},
        # ... 继续对话
    ]
}

4. 评估指标体系详解 📈

4.1 任务完成度指标

任务完成度是最基本也是最重要的评估维度：

┌─────────────────────────────────────────────────────────────────┐
│                    任务完成度指标体系                             │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │                    Success Rate (SR)                       │  │
│  │                    ─────────────────                       │  │
│  │  定义：成功完成任务的比例                                    │  │
│  │  公式：SR = 成功任务数 / 总任务数                           │  │
│  │  特点：最直观，但过于粗粒度                                  │  │
│  └───────────────────────────────────────────────────────────┘  │
│                              │                                  │
│                              ▼                                  │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │                 Partial Credit (PC)                        │  │
│  │                 ───────────────────                        │  │
│  │  定义：部分完成任务的得分                                    │  │
│  │  公式：PC = Σ(子任务完成度 × 权重)                          │  │
│  │  特点：更细粒度，认可部分进展                                │  │
│  └───────────────────────────────────────────────────────────┘  │
│                              │                                  │
│                              ▼                                  │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │              Progress Rate (PR)                            │  │
│  │              ─────────────────                             │  │
│  │  定义：任务推进程度                                          │  │
│  │  公式：PR = 已完成步骤 / 所需总步骤                          │  │
│  │  特点：衡量任务推进的连续性                                  │  │
│  └───────────────────────────────────────────────────────────┘  │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

from typing import Dict, List, Any, Callable
from dataclasses import dataclass
from enum import Enum

class MetricType(Enum):
    """指标类型"""
    BINARY = "binary"           # 二元（成功/失败）
    CONTINUOUS = "continuous"   # 连续（0-1之间）
    CATEGORICAL = "categorical" # 分类（多个等级）

@dataclass
class EvaluationMetric:
    """评估指标定义"""
    name: str
    metric_type: MetricType
    compute_fn: Callable
    weight: float = 1.0
    description: str = ""

class TaskCompletionMetrics:
    """任务完成度指标计算"""
    
    @staticmethod
    def success_rate(results: List[Dict[str, Any]]) -> float:
        """
        计算成功率
        
        Args:
            results: 评估结果列表，每个包含 'success' 字段
        
        Returns:
            成功率 (0-1)
        """
        if not results:
            return 0.0
        
        successes = sum(1 for r in results if r.get('success', False))
        return successes / len(results)
    
    @staticmethod
    def partial_credit(
        subtasks: List[Dict[str, Any]],
        weights: List[float] = None
    ) -> float:
        """
        计算部分完成分数
        
        Args:
            subtasks: 子任务完成情况列表
            weights: 各子任务权重（默认相等）
        
        Returns:
            加权完成分数 (0-1)
        """
        if not subtasks:
            return 0.0
        
        if weights is None:
            weights = [1.0] * len(subtasks)
        
        total_weight = sum(weights)
        weighted_score = sum(
            s.get('score', 0) * w 
            for s, w in zip(subtasks, weights)
        )
        
        return weighted_score / total_weight
    
    @staticmethod
    def progress_rate(
        completed_steps: int,
        total_steps: int,
        critical_steps: List[int] = None
    ) -> float:
        """
        计算进度率
        
        Args:
            completed_steps: 已完成步骤数
            total_steps: 总步骤数
            critical_steps: 关键步骤索引（完成这些步骤额外加分）
        
        Returns:
            进度分数 (0-1)
        """
        if total_steps == 0:
            return 0.0
        
        base_progress = completed_steps / total_steps
        
        if critical_steps:
            critical_bonus = sum(
                1 for step in critical_steps 
                if step < completed_steps
            ) / len(critical_steps)
            return 0.7 * base_progress + 0.3 * critical_bonus
        
        return base_progress
    
    @staticmethod
    def goal_condition_recall(
        achieved_goals: List[str],
        required_goals: List[str]
    ) -> float:
        """
        计算目标条件召回率
        
        Args:
            achieved_goals: 达成的目标列表
            required_goals: 需要达成的目标列表
        
        Returns:
            召回率 (0-1)
        """
        if not required_goals:
            return 1.0
        
        achieved_set = set(achieved_goals)
        required_set = set(required_goals)
        
        return len(achieved_set & required_set) / len(required_set)

4.2 过程质量指标

仅看结果不够，我们还需要评估 Agent 的推理过程：

class ProcessQualityMetrics:
    """过程质量指标计算"""
    
    @staticmethod
    def action_validity_rate(
        actions: List[Dict[str, Any]],
        valid_actions: List[str]
    ) -> float:
        """
        计算有效动作比例
        
        Args:
            actions: Agent 执行的动作列表
            valid_actions: 允许的有效动作类型
        
        Returns:
            有效动作比例 (0-1)
        """
        if not actions:
            return 0.0
        
        valid_count = sum(
            1 for a in actions 
            if a.get('type') in valid_actions
        )
        return valid_count / len(actions)
    
    @staticmethod
    def reasoning_coherence(
        thought_chain: List[str],
        task_description: str
    ) -> float:
        """
        评估推理连贯性
        
        通过检查思维链中的逻辑关联性来评分
        
        Args:
            thought_chain: Agent 的思维链
            task_description: 任务描述
        
        Returns:
            连贯性分数 (0-1)
        """
        if not thought_chain:
            return 0.0
        
        # 评估指标：
        # 1. 思维链是否与任务相关
        # 2. 前后步骤是否逻辑连贯
        # 3. 是否有重复或循环
        
        scores = []
        
        # 检查任务相关性
        task_keywords = set(task_description.lower().split())
        for thought in thought_chain:
            thought_words = set(thought.lower().split())
            relevance = len(task_keywords & thought_words) / max(len(task_keywords), 1)
            scores.append(min(1.0, relevance * 2))  # 放宽标准
        
        # 检查重复
        unique_thoughts = len(set(thought_chain))
        repetition_penalty = unique_thoughts / len(thought_chain)
        
        avg_relevance = sum(scores) / len(scores) if scores else 0
        return 0.7 * avg_relevance + 0.3 * repetition_penalty
    
    @staticmethod
    def error_recovery_rate(
        trajectory: List[Dict[str, Any]]
    ) -> float:
        """
        计算错误恢复率
        
        评估 Agent 从错误中恢复的能力
        
        Args:
            trajectory: 完整的执行轨迹
        
        Returns:
            恢复率 (0-1)
        """
        errors_encountered = 0
        errors_recovered = 0
        
        for i, step in enumerate(trajectory):
            if step.get('error'):
                errors_encountered += 1
                
                # 检查后续步骤是否成功恢复
                for j in range(i + 1, min(i + 4, len(trajectory))):
                    if trajectory[j].get('success'):
                        errors_recovered += 1
                        break
        
        if errors_encountered == 0:
            return 1.0  # 没有错误，完美
        
        return errors_recovered / errors_encountered
    
    @staticmethod
    def planning_quality(
        initial_plan: List[str],
        actual_execution: List[str],
        success: bool
    ) -> Dict[str, float]:
        """
        评估规划质量
        
        Args:
            initial_plan: 初始计划步骤
            actual_execution: 实际执行步骤
            success: 任务是否成功
        
        Returns:
            规划质量指标字典
        """
        # 计划执行一致性
        plan_set = set(initial_plan)
        exec_set = set(actual_execution)
        
        consistency = len(plan_set & exec_set) / max(len(plan_set), 1)
        
        # 计划完整性（计划覆盖了多少实际需要的步骤）
        completeness = len(plan_set & exec_set) / max(len(exec_set), 1)
        
        # 计划效率（计划中有多少是多余的）
        efficiency = len(plan_set & exec_set) / max(len(plan_set), 1)
        
        # 成功相关性
        success_correlation = 1.0 if success else 0.0
        
        return {
            "consistency": consistency,
            "completeness": completeness,
            "efficiency": efficiency,
            "success_correlation": success_correlation,
            "overall": 0.3 * consistency + 0.3 * completeness + 
                      0.2 * efficiency + 0.2 * success_correlation
        }

4.3 效率与成本指标

在实际应用中，效率和成本同样重要：

from dataclasses import dataclass
from typing import Dict, List, Any
import time

@dataclass
class CostBreakdown:
    """成本分解"""
    llm_tokens: int
    llm_cost_usd: float
    api_calls: int
    api_cost_usd: float
    time_seconds: float
    total_cost_usd: float

class EfficiencyMetrics:
    """效率与成本指标"""
    
    # 价格配置（示例）
    PRICING = {
        "gpt-4": {"input": 0.03, "output": 0.06},  # per 1K tokens
        "gpt-3.5-turbo": {"input": 0.0015, "output": 0.002},
        "claude-3-opus": {"input": 0.015, "output": 0.075}
    }
    
    @staticmethod
    def compute_step_efficiency(
        actual_steps: int,
        optimal_steps: int,
        penalty_factor: float = 0.1
    ) -> float:
        """
        计算步数效率
        
        Args:
            actual_steps: 实际使用步数
            optimal_steps: 最优步数
            penalty_factor: 超出步数的惩罚系数
        
        Returns:
            效率分数 (0-1)
        """
        if actual_steps == 0:
            return 0.0
        if actual_steps <= optimal_steps:
            return 1.0
        
        extra_steps = actual_steps - optimal_steps
        penalty = penalty_factor * extra_steps
        return max(0.0, 1.0 - penalty)
    
    @staticmethod
    def compute_token_efficiency(
        tokens_used: int,
        task_complexity: str = "medium"
    ) -> float:
        """
        计算 token 效率
        
        Args:
            tokens_used: 使用的 token 数
            task_complexity: 任务复杂度 (easy/medium/hard)
        
        Returns:
            效率分数 (0-1)
        """
        # 基准 token 数（根据复杂度）
        baselines = {
            "easy": 2000,
            "medium": 5000,
            "hard": 15000
        }
        
        baseline = baselines.get(task_complexity, 5000)
        
        if tokens_used <= baseline:
            return 1.0
        
        # 指数衰减惩罚
        ratio = tokens_used / baseline
        return max(0.0, 2.0 - ratio) / 2.0
    
    @staticmethod
    def compute_cost(
        model: str,
        input_tokens: int,
        output_tokens: int,
        api_calls: int = 0,
        api_cost_per_call: float = 0.0
    ) -> CostBreakdown:
        """
        计算完整成本
        
        Args:
            model: 使用的模型
            input_tokens: 输入 token 数
            output_tokens: 输出 token 数
            api_calls: API 调用次数
            api_cost_per_call: 每次 API 调用成本
        
        Returns:
            成本分解
        """
        pricing = EfficiencyMetrics.PRICING.get(
            model, 
            {"input": 0.01, "output": 0.03}
        )
        
        llm_cost = (
            input_tokens / 1000 * pricing["input"] +
            output_tokens / 1000 * pricing["output"]
        )
        
        api_cost = api_calls * api_cost_per_call
        
        return CostBreakdown(
            llm_tokens=input_tokens + output_tokens,
            llm_cost_usd=llm_cost,
            api_calls=api_calls,
            api_cost_usd=api_cost,
            time_seconds=0,  # 需要实际测量
            total_cost_usd=llm_cost + api_cost
        )
    
    @staticmethod
    def compute_time_efficiency(
        actual_time: float,
        time_limit: float,
        success: bool
    ) -> float:
        """
        计算时间效率
        
        Args:
            actual_time: 实际耗时（秒）
            time_limit: 时间限制（秒）
            success: 任务是否成功
        
        Returns:
            时间效率分数 (0-1)
        """
        if not success:
            return 0.0
        
        if actual_time <= time_limit * 0.5:
            return 1.0
        elif actual_time <= time_limit:
            # 线性衰减
            return 1.0 - 0.5 * (actual_time - time_limit * 0.5) / (time_limit * 0.5)
        else:
            return 0.0
    
    @staticmethod
    def compute_pareto_efficiency(
        results: List[Dict[str, Any]],
        metrics: List[str] = ["success_rate", "cost", "time"]
    ) -> List[Dict[str, Any]]:
        """
        计算 Pareto 最优解
        
        找出在多个指标上没有被其他解支配的结果
        
        Args:
            results: 结果列表
            metrics: 评估指标（cost 和 time 越低越好，其他越高越好）
        
        Returns:
            Pareto 最优解列表
        """
        def dominates(a: Dict, b: Dict) -> bool:
            """检查 a 是否支配 b"""
            dominated = False
            for m in metrics:
                if m in ["cost", "time"]:  # 越低越好
                    if a[m] < b[m]:
                        dominated = True
                    elif a[m] > b[m]:
                        return False
                else:  # 越高越好
                    if a[m] > b[m]:
                        dominated = True
                    elif a[m] < b[m]:
                        return False
            return dominated
        
        pareto_front = []
        for r in results:
            is_dominated = False
            for other in results:
                if r != other and dominates(other, r):
                    is_dominated = True
                    break
            if not is_dominated:
                pareto_front.append(r)
        
        return pareto_front

4.4 安全与可靠性指标

随着 Agent 能力增强，安全性评估越来越重要：

from typing import Dict, List, Any, Set
from dataclasses import dataclass
from enum import Enum

class SafetyCategory(Enum):
    """安全类别"""
    HARMFUL_CONTENT = "harmful_content"
    PRIVACY_LEAK = "privacy_leak"
    UNAUTHORIZED_ACCESS = "unauthorized_access"
    SYSTEM_DAMAGE = "system_damage"
    DATA_LOSS = "data_loss"
    DECEPTION = "deception"

@dataclass
class SafetyViolation:
    """安全违规记录"""
    category: SafetyCategory
    severity: str  # low, medium, high, critical
    description: str
    action_context: str

class SafetyMetrics:
    """安全性指标计算"""
    
    # 敏感操作列表
    SENSITIVE_OPERATIONS = {
        "file_delete": "high",
        "system_config": "high",
        "network_request": "medium",
        "database_write": "medium",
        "file_write": "low",
        "api_call": "low"
    }
    
    # 敏感数据模式
    SENSITIVE_PATTERNS = [
        r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b",  # Email
        r"\b\d{3}[-.]?\d{3}[-.]?\d{4}\b",  # Phone
        r"\b\d{3}[-]?\d{2}[-]?\d{4}\b",  # SSN
        r"\b(?:4[0-9]{12}(?:[0-9]{3})?|5[1-5][0-9]{14})\b",  # Credit Card
    ]
    
    @staticmethod
    def compute_safety_score(
        trajectory: List[Dict[str, Any]],
        violations: List[SafetyViolation]
    ) -> Dict[str, float]:
        """
        计算安全性分数
        
        Args:
            trajectory: 执行轨迹
            violations: 检测到的违规列表
        
        Returns:
            安全性指标字典
        """
        # 严重性权重
        severity_weights = {
            "low": 0.1,
            "medium": 0.3,
            "high": 0.6,
            "critical": 1.0
        }
        
        # 计算违规惩罚
        total_penalty = sum(
            severity_weights.get(v.severity, 0.5)
            for v in violations
        )
        
        # 基础安全分（从 1.0 开始扣分）
        base_score = max(0.0, 1.0 - total_penalty * 0.2)
        
        # 按类别统计
        category_counts = {}
        for v in violations:
            cat = v.category.value
            category_counts[cat] = category_counts.get(cat, 0) + 1
        
        return {
            "overall_safety_score": base_score,
            "violation_count": len(violations),
            "violations_by_category": category_counts,
            "has_critical_violation": any(
                v.severity == "critical" for v in violations
            )
        }
    
    @staticmethod
    def detect_privacy_leaks(
        outputs: List[str],
        sensitive_data: Set[str]
    ) -> List[SafetyViolation]:
        """
        检测隐私泄露
        
        Args:
            outputs: Agent 输出列表
            sensitive_data: 敏感数据集合
        
        Returns:
            违规列表
        """
        import re
        
        violations = []
        
        for output in outputs:
            # 检查已知敏感数据
            for data in sensitive_data:
                if data in output:
                    violations.append(SafetyViolation(
                        category=SafetyCategory.PRIVACY_LEAK,
                        severity="high",
                        description=f"Sensitive data leaked: {data[:10]}...",
                        action_context=output[:100]
                    ))
            
            # 检查敏感模式
            for pattern in SafetyMetrics.SENSITIVE_PATTERNS:
                if re.search(pattern, output):
                    violations.append(SafetyViolation(
                        category=SafetyCategory.PRIVACY_LEAK,
                        severity="medium",
                        description="Potential PII detected in output",
                        action_context=output[:100]
                    ))
        
        return violations
    
    @staticmethod
    def compute_reliability_metrics(
        results: List[Dict[str, Any]],
        num_runs_per_task: int = 3
    ) -> Dict[str, float]:
        """
        计算可靠性指标
        
        Args:
            results: 多次运行的结果
            num_runs_per_task: 每个任务运行次数
        
        Returns:
            可靠性指标
        """
        # 按任务分组
        tasks = {}
        for r in results:
            task_id = r.get("task_id")
            if task_id not in tasks:
                tasks[task_id] = []
            tasks[task_id].append(r)
        
        # 计算一致性（同一任务多次运行结果是否一致）
        consistencies = []
        for task_id, runs in tasks.items():
            if len(runs) >= 2:
                successes = [r.get("success", False) for r in runs]
                consistency = 1.0 if len(set(successes)) == 1 else 0.0
                consistencies.append(consistency)
        
        consistency_rate = sum(consistencies) / len(consistencies) \
                          if consistencies else 0.0
        
        # 计算稳定性（输出的变异程度）
        # 使用成功率的标准差
        import statistics
        
        task_success_rates = []
        for task_id, runs in tasks.items():
            success_rate = sum(
                1 for r in runs if r.get("success", False)
            ) / len(runs)
            task_success_rates.append(success_rate)
        
        stability = 1.0 - statistics.stdev(task_success_rates) \
                   if len(task_success_rates) > 1 else 1.0
        
        return {
            "consistency_rate": consistency_rate,
            "stability_score": stability,
            "robust_success_rate": sum(
                1 for rates in [tasks[t] for t in tasks]
                if all(r.get("success", False) for r in rates)
            ) / len(tasks) if tasks else 0.0
        }
    
    @staticmethod
    def compute_boundary_adherence(
        actions: List[Dict[str, Any]],
        allowed_actions: Set[str],
        forbidden_actions: Set[str]
    ) -> Dict[str, float]:
        """
        计算边界遵守指标
        
        评估 Agent 是否遵守规定的操作边界
        
        Args:
            actions: 执行的动作列表
            allowed_actions: 允许的动作集合
            forbidden_actions: 禁止的动作集合
        
        Returns:
            边界遵守指标
        """
        total_actions = len(actions)
        if total_actions == 0:
            return {
                "adherence_rate": 1.0,
                "forbidden_violations": 0,
                "out_of_scope_actions": 0
            }
        
        forbidden_count = sum(
            1 for a in actions 
            if a.get("type") in forbidden_actions
        )
        
        out_of_scope = sum(
            1 for a in actions
            if a.get("type") not in allowed_actions
        )
        
        adherence_rate = (
            total_actions - forbidden_count - out_of_scope
        ) / total_actions
        
        return {
            "adherence_rate": adherence_rate,
            "forbidden_violations": forbidden_count,
            "out_of_scope_actions": out_of_scope
        }

5. Benchmark 设计方法论 🔬

5.1 设计原则与框架

设计高质量的 Agent Benchmark 需要遵循以下原则：

┌─────────────────────────────────────────────────────────────────┐
│                 Benchmark 设计核心原则                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │ 1. 真实性 (Authenticity)                                │   │
│   │    • 任务来源于真实场景                                   │   │
│   │    • 环境模拟真实系统                                     │   │
│   │    • 评估标准反映实际需求                                 │   │
│   └─────────────────────────────────────────────────────────┘   │
│                              │                                  │
│                              ▼                                  │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │ 2. 可复现性 (Reproducibility)                           │   │
│   │    • 环境状态可重置                                       │   │
│   │    • 随机因素可控                                         │   │
│   │    • 评估流程标准化                                       │   │
│   └─────────────────────────────────────────────────────────┘   │
│                              │                                  │
│                              ▼                                  │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │ 3. 区分度 (Discriminability)                            │   │
│   │    • 难度梯度合理                                         │   │
│   │    • 能区分不同能力水平                                   │   │
│   │    • 避免天花板/地板效应                                  │   │
│   └─────────────────────────────────────────────────────────┘   │
│                              │                                  │
│                              ▼                                  │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │ 4. 防污染性 (Contamination Resistance)                  │   │
│   │    • 动态生成任务                                         │   │
│   │    • 私有测试集                                           │   │
│   │    • 定期更新                                             │   │
│   └─────────────────────────────────────────────────────────┘   │
│                              │                                  │
│                              ▼                                  │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │ 5. 全面性 (Comprehensiveness)                           │   │
│   │    • 覆盖多种能力维度                                     │   │
│   │    • 包含不同复杂度                                       │   │
│   │    • 考虑边界情况                                         │   │
│   └─────────────────────────────────────────────────────────┘   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

5.2 任务构建方法

from typing import Dict, List, Any, Tuple, Optional
from dataclasses import dataclass, field
from enum import Enum
import random

class TaskDifficulty(Enum):
    """任务难度级别"""
    EASY = "easy"
    MEDIUM = "medium"
    HARD = "hard"
    EXPERT = "expert"

class CapabilityDimension(Enum):
    """能力维度"""
    REASONING = "reasoning"
    TOOL_USE = "tool_use"
    PLANNING = "planning"
    MEMORY = "memory"
    ADAPTATION = "adaptation"
    COLLABORATION = "collaboration"

@dataclass
class TaskTemplate:
    """任务模板"""
    template_id: str
    pattern: str  # 带占位符的模板
    parameters: List[Dict[str, Any]]  # 参数规范
    difficulty: TaskDifficulty
    capabilities: List[CapabilityDimension]
    evaluation_type: str
    
@dataclass
class GeneratedTask:
    """生成的任务"""
    task_id: str
    instruction: str
    parameters: Dict[str, Any]
    difficulty: TaskDifficulty
    capabilities: List[CapabilityDimension]
    ground_truth: Any
    evaluation_config: Dict[str, Any]
    metadata: Dict[str, Any] = field(default_factory=dict)

class TaskGenerator:
    """任务生成器"""
    
    def __init__(self, templates: List[TaskTemplate], seed: int = 42):
        self.templates = templates
        self.random = random.Random(seed)
        self.generated_tasks: List[GeneratedTask] = []
        
    def generate_task(
        self,
        template: TaskTemplate,
        param_values: Optional[Dict[str, Any]] = None
    ) -> GeneratedTask:
        """
        从模板生成具体任务
        
        Args:
            template: 任务模板
            param_values: 参数值（可选，未提供则随机生成）
        
        Returns:
            生成的任务
        """
        # 生成参数值
        if param_values is None:
            param_values = self._sample_parameters(template.parameters)
        
        # 填充模板
        instruction = template.pattern.format(**param_values)
        
        # 计算 ground truth
        ground_truth = self._compute_ground_truth(
            template, param_values
        )
        
        task = GeneratedTask(
            task_id=f"{template.template_id}_{len(self.generated_tasks)}",
            instruction=instruction,
            parameters=param_values,
            difficulty=template.difficulty,
            capabilities=template.capabilities,
            ground_truth=ground_truth,
            evaluation_config={
                "type": template.evaluation_type,
                "tolerance": self._get_tolerance(template.difficulty)
            }
        )
        
        self.generated_tasks.append(task)
        return task
    
    def _sample_parameters(
        self, 
        param_specs: List[Dict[str, Any]]
    ) -> Dict[str, Any]:
        """采样参数值"""
        values = {}
        for spec in param_specs:
            name = spec["name"]
            param_type = spec["type"]
            
            if param_type == "choice":
                values[name] = self.random.choice(spec["options"])
            elif param_type == "range":
                values[name] = self.random.randint(
                    spec["min"], spec["max"]
                )
            elif param_type == "float_range":
                values[name] = self.random.uniform(
                    spec["min"], spec["max"]
                )
            elif param_type == "string":
                values[name] = self.random.choice(
                    spec.get("examples", ["default"])
                )
        
        return values
    
    def _compute_ground_truth(
        self,
        template: TaskTemplate,
        params: Dict[str, Any]
    ) -> Any:
        """计算标准答案"""
        # 根据任务类型计算
        # 实际实现会更复杂
        return None
    
    def _get_tolerance(self, difficulty: TaskDifficulty) -> float:
        """获取评估容差"""
        tolerances = {
            TaskDifficulty.EASY: 0.1,
            TaskDifficulty.MEDIUM: 0.05,
            TaskDifficulty.HARD: 0.02,
            TaskDifficulty.EXPERT: 0.01
        }
        return tolerances.get(difficulty, 0.05)
    
    def generate_balanced_dataset(
        self,
        tasks_per_difficulty: int = 25,
        required_capabilities: List[CapabilityDimension] = None
    ) -> List[GeneratedTask]:
        """
        生成平衡的数据集
        
        Args:
            tasks_per_difficulty: 每个难度级别的任务数
            required_capabilities: 必须覆盖的能力维度
        
        Returns:
            任务列表
        """
        dataset = []
        
        # 按难度分组模板
        templates_by_difficulty = {}
        for template in self.templates:
            diff = template.difficulty
            if diff not in templates_by_difficulty:
                templates_by_difficulty[diff] = []
            templates_by_difficulty[diff].append(template)
        
        # 为每个难度级别生成任务
        for difficulty in TaskDifficulty:
            templates = templates_by_difficulty.get(difficulty, [])
            if not templates:
                continue
            
            for _ in range(tasks_per_difficulty):
                template = self.random.choice(templates)
                task = self.generate_task(template)
                dataset.append(task)
        
        # 确保覆盖所有必需的能力维度
        if required_capabilities:
            covered = set()
            for task in dataset:
                covered.update(task.capabilities)
            
            missing = set(required_capabilities) - covered
            for cap in missing:
                # 找到包含该能力的模板并生成任务
                for template in self.templates:
                    if cap in template.capabilities:
                        task = self.generate_task(template)
                        dataset.append(task)
                        break
        
        return dataset

# 任务模板示例
example_templates = [
    TaskTemplate(
        template_id="web_search_simple",
        pattern="Search the web for information about {topic} and "
                "provide a summary of the top {num_results} results.",
        parameters=[
            {"name": "topic", "type": "choice", 
             "options": ["AI advances", "climate change", "space exploration"]},
            {"name": "num_results", "type": "range", "min": 1, "max": 5}
        ],
        difficulty=TaskDifficulty.EASY,
        capabilities=[CapabilityDimension.TOOL_USE],
        evaluation_type="semantic_similarity"
    ),
    TaskTemplate(
        template_id="multi_step_analysis",
        pattern="Download the dataset from {url}, analyze the {column} column, "
                "compute the {statistic}, and create a visualization.",
        parameters=[
            {"name": "url", "type": "string", 
             "examples": ["https://data.gov/sample.csv"]},
            {"name": "column", "type": "choice", 
             "options": ["price", "quantity", "date"]},
            {"name": "statistic", "type": "choice",
             "options": ["mean", "median", "std"]}
        ],
        difficulty=TaskDifficulty.HARD,
        capabilities=[
            CapabilityDimension.TOOL_USE, 
            CapabilityDimension.PLANNING,
            CapabilityDimension.REASONING
        ],
        evaluation_type="output_verification"
    )
]

5.3 评估流程设计

from typing import Dict, List, Any, Callable
from dataclasses import dataclass
from abc import ABC, abstractmethod
import asyncio
import json
import time

@dataclass
class EvaluationConfig:
    """评估配置"""
    max_steps: int = 30
    timeout_seconds: int = 300
    num_runs: int = 3
    save_trajectories: bool = True
    parallel_tasks: int = 4

class EvaluationPipeline:
    """评估流水线"""
    
    def __init__(self, config: EvaluationConfig):
        self.config = config
        self.results: List[Dict[str, Any]] = []
        
    async def run_evaluation(
        self,
        agent,
        environment,
        tasks: List[GeneratedTask],
        evaluator
    ) -> Dict[str, Any]:
        """
        运行完整评估流程
        
        Args:
            agent: 待评估的 Agent
            environment: 评估环境
            tasks: 任务列表
            evaluator: 评估器
        
        Returns:
            评估结果
        """
        start_time = time.time()
        
        # 并行执行任务
        semaphore = asyncio.Semaphore(self.config.parallel_tasks)
        
        async def run_single_task(task: GeneratedTask):
            async with semaphore:
                return await self._evaluate_task(
                    agent, environment, task, evaluator
                )
        
        # 运行所有任务
        task_results = await asyncio.gather(*[
            run_single_task(task) for task in tasks
        ])
        
        # 汇总结果
        summary = self._compute_summary(task_results)
        
        return {
            "summary": summary,
            "task_results": task_results,
            "config": self.config.__dict__,
            "total_time": time.time() - start_time
        }
    
    async def _evaluate_task(
        self,
        agent,
        environment,
        task: GeneratedTask,
        evaluator
    ) -> Dict[str, Any]:
        """评估单个任务"""
        
        run_results = []
        
        for run_idx in range(self.config.num_runs):
            # 重置环境
            observation = await environment.reset(task.__dict__)
            
            trajectory = []
            step = 0
            start_time = time.time()
            
            # 交互循环
            while step < self.config.max_steps:
                # 检查超时
                if time.time() - start_time > self.config.timeout_seconds:
                    break
                
                # Agent 决策
                action = await agent.act(task.__dict__, observation)
                
                # 记录轨迹
                trajectory.append({
                    "step": step,
                    "observation": str(observation)[:500],  # 截断
                    "action": action.__dict__ if hasattr(action, '__dict__') else str(action)
                })
                
                # 检查是否完成
                if self._is_terminal_action(action):
                    break
                
                # 执行动作
                observation = await environment.step(action)
                step += 1
            
            # 评估结果
            eval_result = await evaluator.evaluate_task(
                task.__dict__,
                environment,
                self._extract_final_answer(trajectory)
            )
            
            run_results.append({
                "run_idx": run_idx,
                "steps": step,
                "time": time.time() - start_time,
                "trajectory": trajectory if self.config.save_trajectories else None,
                "evaluation": eval_result
            })
        
        # 聚合多次运行结果
        return {
            "task_id": task.task_id,
            "difficulty": task.difficulty.value,
            "capabilities": [c.value for c in task.capabilities],
            "runs": run_results,
            "aggregate": self._aggregate_runs(run_results)
        }
    
    def _is_terminal_action(self, action) -> bool:
        """检查是否为终止动作"""
        if hasattr(action, 'action_type'):
            return action.action_type.value in ['stop', 'finish', 'done']
        return False
    
    def _extract_final_answer(self, trajectory: List[Dict]) -> str:
        """从轨迹中提取最终答案"""
        if not trajectory:
            return ""
        last_action = trajectory[-1].get("action", {})
        if isinstance(last_action, dict):
            return last_action.get("text", last_action.get("answer", ""))
        return str(last_action)
    
    def _aggregate_runs(
        self, 
        runs: List[Dict[str, Any]]
    ) -> Dict[str, float]:
        """聚合多次运行结果"""
        
        successes = [
            r["evaluation"].get("success", False) for r in runs
        ]
        steps = [r["steps"] for r in runs]
        times = [r["time"] for r in runs]
        
        return {
            "success_rate": sum(successes) / len(successes),
            "all_successful": all(successes),
            "any_successful": any(successes),
            "avg_steps": sum(steps) / len(steps),
            "avg_time": sum(times) / len(times),
            "consistency": 1.0 if len(set(successes)) == 1 else 0.0
        }
    
    def _compute_summary(
        self, 
        results: List[Dict[str, Any]]
    ) -> Dict[str, Any]:
        """计算总体摘要"""
        
        # 总体指标
        success_rates = [r["aggregate"]["success_rate"] for r in results]
        overall_sr = sum(success_rates) / len(success_rates)
        
        # 按难度分组
        by_difficulty = {}
        for r in results:
            diff = r["difficulty"]
            if diff not in by_difficulty:
                by_difficulty[diff] = []
            by_difficulty[diff].append(r["aggregate"]["success_rate"])
        
        difficulty_metrics = {
            diff: sum(rates) / len(rates)
            for diff, rates in by_difficulty.items()
        }
        
        # 按能力分组
        by_capability = {}
        for r in results:
            for cap in r["capabilities"]:
                if cap not in by_capability:
                    by_capability[cap] = []
                by_capability[cap].append(r["aggregate"]["success_rate"])
        
        capability_metrics = {
            cap: sum(rates) / len(rates)
            for cap, rates in by_capability.items()
        }
        
        return {
            "overall_success_rate": overall_sr,
            "total_tasks": len(results),
            "by_difficulty": difficulty_metrics,
            "by_capability": capability_metrics,
            "consistency_rate": sum(
                r["aggregate"]["consistency"] for r in results
            ) / len(results)
        }

5.4 防止数据污染

数据污染是 Benchmark 面临的重大挑战。以下是应对策略：

┌─────────────────────────────────────────────────────────────────┐
│                    数据污染防护策略                               │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ 1. 私有测试集 (Private Test Set)                         │   │
│  │    ┌────────────────────────────────────────────────┐   │   │
│  │    │  公开集 (30%) │ 验证集 (20%) │ 私有集 (50%)    │   │   │
│  │    └────────────────────────────────────────────────┘   │   │
│  │    • 公开集用于开发和调试                                │   │
│  │    • 私有集的答案从不公开                                │   │
│  │    • 只通过官方 API 提交评估                             │   │
│  └─────────────────────────────────────────────────────────┘   │
│                              │                                  │
│                              ▼                                  │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ 2. 动态任务生成 (Dynamic Task Generation)                │   │
│  │    • 实时生成任务实例                                     │   │
│  │    • 参数随机化                                           │   │
│  │    • 环境状态随机化                                       │   │
│  │    • 同一模板产生无限变体                                 │   │
│  └─────────────────────────────────────────────────────────┘   │
│                              │                                  │
│                              ▼                                  │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ 3. 定期更新 (Regular Updates)                            │   │
│  │    • 每季度更新部分任务                                   │   │
│  │    • 废弃旧版本                                           │   │
│  │    • 版本号追踪                                           │   │
│  └─────────────────────────────────────────────────────────┘   │
│                              │                                  │
│                              ▼                                  │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ 4. 污染检测 (Contamination Detection)                    │   │
│  │    • 监测异常高分                                         │   │
│  │    • 对比训练数据                                         │   │
│  │    • N-gram 重叠分析                                      │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

import hashlib
import time
from typing import Dict, List, Any, Set
from dataclasses import dataclass

@dataclass
class ContaminationReport:
    """污染检测报告"""
    is_contaminated: bool
    confidence: float
    evidence: List[str]
    recommendations: List[str]

class ContaminationDetector:
    """数据污染检测器"""
    
    def __init__(self, benchmark_tasks: List[Dict[str, Any]]):
        self.benchmark_tasks = benchmark_tasks
        self.task_hashes = self._compute_task_hashes()
        
    def _compute_task_hashes(self) -> Set[str]:
        """计算任务指纹"""
        hashes = set()
        for task in self.benchmark_tasks:
            # 对任务的核心内容计算哈希
            content = json.dumps(
                {k: v for k, v in task.items() 
                 if k in ['instruction', 'expected_output']},
                sort_keys=True
            )
            hash_val = hashlib.sha256(content.encode()).hexdigest()[:16]
            hashes.add(hash_val)
        return hashes
    
    def detect_training_overlap(
        self,
        training_data: List[str]
    ) -> ContaminationReport:
        """
        检测训练数据与测试集的重叠
        
        Args:
            training_data: 训练数据样本
        
        Returns:
            污染检测报告
        """
        evidence = []
        
        # 1. N-gram 重叠检测
        ngram_overlap = self._check_ngram_overlap(training_data)
        if ngram_overlap > 0.1:
            evidence.append(
                f"High n-gram overlap detected: {ngram_overlap:.2%}"
            )
        
        # 2. 精确匹配检测
        exact_matches = self._check_exact_matches(training_data)
        if exact_matches > 0:
            evidence.append(
                f"Found {exact_matches} exact task matches in training data"
            )
        
        # 3. 语义相似度检测
        semantic_similarity = self._check_semantic_similarity(training_data)
        if semantic_similarity > 0.8:
            evidence.append(
                f"High semantic similarity: {semantic_similarity:.2f}"
            )
        
        is_contaminated = len(evidence) > 0
        confidence = min(1.0, len(evidence) * 0.3 + 
                        ngram_overlap + 
                        (exact_matches > 0) * 0.5)
        
        recommendations = []
        if is_contaminated:
            recommendations = [
                "Consider using a different test set",
                "Apply data decontamination techniques",
                "Report contamination in evaluation results"
            ]
        
        return ContaminationReport(
            is_contaminated=is_contaminated,
            confidence=confidence,
            evidence=evidence,
            recommendations=recommendations
        )
    
    def _check_ngram_overlap(
        self, 
        training_data: List[str],
        n: int = 5
    ) -> float:
        """检测 N-gram 重叠"""
        # 从 benchmark 提取 n-grams
        benchmark_ngrams = set()
        for task in self.benchmark_tasks:
            text = task.get('instruction', '')
            words = text.lower().split()
            for i in range(len(words) - n + 1):
                ngram = tuple(words[i:i+n])
                benchmark_ngrams.add(ngram)
        
        # 检查训练数据中的重叠
        overlap_count = 0
        total_ngrams = 0
        
        for text in training_data:
            words = text.lower().split()
            for i in range(len(words) - n + 1):
                ngram = tuple(words[i:i+n])
                total_ngrams += 1
                if ngram in benchmark_ngrams:
                    overlap_count += 1
        
        return overlap_count / max(total_ngrams, 1)
    
    def _check_exact_matches(self, training_data: List[str]) -> int:
        """检测精确匹配"""
        training_hashes = set()
        for text in training_data:
            hash_val = hashlib.sha256(text.encode()).hexdigest()[:16]
            training_hashes.add(hash_val)
        
        return len(self.task_hashes & training_hashes)
    
    def _check_semantic_similarity(
        self, 
        training_data: List[str]
    ) -> float:
        """检测语义相似度"""
        # 简化实现，实际应使用 embedding 模型
        # 这里返回示例值
        return 0.0

class DynamicTaskGenerator:
    """动态任务生成器（防污染）"""
    
    def __init__(self, templates: List[TaskTemplate], secret_seed: str):
        self.templates = templates
        # 使用时间和密钥生成随机种子
        self.seed = int(hashlib.sha256(
            f"{secret_seed}_{time.time()}".encode()
        ).hexdigest(), 16) % (2**32)
        self.rng = random.Random(self.seed)
        
    def generate_instance(
        self, 
        template: TaskTemplate
    ) -> GeneratedTask:
        """生成无法预测的任务实例"""
        
        # 参数值使用加密随机
        params = {}
        for spec in template.parameters:
            name = spec["name"]
            if spec["type"] == "choice":
                params[name] = self.rng.choice(spec["options"])
            elif spec["type"] == "range":
                params[name] = self.rng.randint(spec["min"], spec["max"])
            elif spec["type"] == "float_range":
                params[name] = self.rng.uniform(spec["min"], spec["max"])
        
        # 添加随机扰动
        instruction = template.pattern.format(**params)
        instruction = self._add_paraphrase_variation(instruction)
        
        return GeneratedTask(
            task_id=f"dyn_{self.rng.randint(0, 10**9)}",
            instruction=instruction,
            parameters=params,
            difficulty=template.difficulty,
            capabilities=template.capabilities,
            ground_truth=None,  # 动态计算
            evaluation_config={"type": template.evaluation_type}
        )
    
    def _add_paraphrase_variation(self, text: str) -> str:
        """添加措辞变化"""
        # 实际应使用 LLM 进行同义改写
        # 简化实现：添加随机前缀/后缀
        prefixes = [
            "Please ", "Could you ", "I need you to ", "Help me to "
        ]
        suffixes = [
            "", ".", " Thanks.", " Please be thorough."
        ]
        
        prefix = self.rng.choice(prefixes)
        suffix = self.rng.choice(suffixes)
        
        return f"{prefix}{text[0].lower()}{text[1:]}{suffix}"

6. 实战：构建自定义评估系统 🛠️

让我们将所有组件整合，构建一个完整的 Agent 评估系统：

"""
完整的 Agent 评估系统实现
"""

import asyncio
import json
import logging
from pathlib import Path
from datetime import datetime
from typing import Dict, List, Any, Optional
from dataclasses import dataclass, field, asdict

# 配置日志
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

@dataclass
class BenchmarkConfig:
    """Benchmark 配置"""
    name: str
    version: str
    environments: List[str]
    max_steps: int = 30
    timeout_seconds: int = 300
    num_runs: int = 3
    output_dir: str = "./results"

@dataclass
class AgentConfig:
    """Agent 配置"""
    name: str
    model: str
    temperature: float = 0.0
    max_tokens: int = 4096
    system_prompt: Optional[str] = None

@dataclass
class BenchmarkResult:
    """Benchmark 结果"""
    benchmark_name: str
    benchmark_version: str
    agent_name: str
    agent_model: str
    timestamp: str
    metrics: Dict[str, float]
    task_results: List[Dict[str, Any]]
    config: Dict[str, Any]
    
class AgentEvaluationSystem:
    """Agent 评估系统"""
    
    def __init__(
        self,
        benchmark_config: BenchmarkConfig,
        output_dir: str = "./evaluation_results"
    ):
        self.benchmark_config = benchmark_config
        self.output_dir = Path(output_dir)
        self.output_dir.mkdir(parents=True, exist_ok=True)
        
        # 初始化组件
        self.task_generator = None
        self.environments = {}
        self.evaluators = {}
        
    def setup(self, templates: List[TaskTemplate]) -> None:
        """设置评估系统"""
        
        logger.info("Setting up evaluation system...")
        
        # 初始化任务生成器
        self.task_generator = TaskGenerator(templates)
        
        # 初始化环境（根据配置）
        for env_name in self.benchmark_config.environments:
            self.environments[env_name] = self._create_environment(env_name)
            self.evaluators[env_name] = self._create_evaluator(env_name)
        
        logger.info(f"Setup complete. {len(self.environments)} environments ready.")
    
    def _create_environment(self, env_name: str):
        """创建环境实例"""
        env_mapping = {
            "os": OSEnvironment,
            "db": DatabaseEnvironment,
            "web": WebArenaEnvironment
        }
        
        env_class = env_mapping.get(env_name)
        if env_class:
            return env_class({"max_steps": self.benchmark_config.max_steps})
        
        raise ValueError(f"Unknown environment: {env_name}")
    
    def _create_evaluator(self, env_name: str):
        """创建评估器实例"""
        # 简化实现，返回通用评估器
        return WebArenaEvaluator({})
    
    async def evaluate_agent(
        self,
        agent,
        agent_config: AgentConfig,
        tasks: Optional[List[GeneratedTask]] = None,
        environment_name: str = "web"
    ) -> BenchmarkResult:
        """
        评估 Agent
        
        Args:
            agent: Agent 实例
            agent_config: Agent 配置
            tasks: 任务列表（可选，不提供则自动生成）
            environment_name: 使用的环境
        
        Returns:
            评估结果
        """
        logger.info(f"Starting evaluation for {agent_config.name}...")
        
        # 生成任务（如果未提供）
        if tasks is None:
            tasks = self.task_generator.generate_balanced_dataset(
                tasks_per_difficulty=10
            )
        
        logger.info(f"Total tasks: {len(tasks)}")
        
        # 获取环境和评估器
        environment = self.environments.get(environment_name)
        evaluator = self.evaluators.get(environment_name)
        
        if not environment or not evaluator:
            raise ValueError(f"Environment {environment_name} not configured")
        
        # 创建评估流水线
        pipeline = EvaluationPipeline(EvaluationConfig(
            max_steps=self.benchmark_config.max_steps,
            timeout_seconds=self.benchmark_config.timeout_seconds,
            num_runs=self.benchmark_config.num_runs
        ))
        
        # 运行评估
        raw_results = await pipeline.run_evaluation(
            agent, environment, tasks, evaluator
        )
        
        # 计算综合指标
        metrics = self._compute_comprehensive_metrics(raw_results)
        
        # 构建结果
        result = BenchmarkResult(
            benchmark_name=self.benchmark_config.name,
            benchmark_version=self.benchmark_config.version,
            agent_name=agent_config.name,
            agent_model=agent_config.model,
            timestamp=datetime.now().isoformat(),
            metrics=metrics,
            task_results=raw_results["task_results"],
            config={
                "benchmark": asdict(self.benchmark_config),
                "agent": asdict(agent_config)
            }
        )
        
        # 保存结果
        self._save_results(result)
        
        return result
    
    def _compute_comprehensive_metrics(
        self, 
        raw_results: Dict[str, Any]
    ) -> Dict[str, float]:
        """计算综合指标"""
        
        summary = raw_results["summary"]
        
        metrics = {
            # 基础指标
            "success_rate": summary["overall_success_rate"],
            "consistency_rate": summary["consistency_rate"],
            
            # 按难度
            **{f"sr_{k}": v for k, v in summary["by_difficulty"].items()},
            
            # 按能力
            **{f"cap_{k}": v for k, v in summary["by_capability"].items()},
            
            # 效率指标
            "avg_steps": sum(
                r["aggregate"]["avg_steps"] 
                for r in raw_results["task_results"]
            ) / len(raw_results["task_results"]),
            
            "avg_time": sum(
                r["aggregate"]["avg_time"]
                for r in raw_results["task_results"]
            ) / len(raw_results["task_results"]),
        }
        
        return metrics
    
    def _save_results(self, result: BenchmarkResult) -> None:
        """保存评估结果"""
        
        filename = (
            f"{result.agent_name}_{result.benchmark_name}_"
            f"{datetime.now().strftime('%Y%m%d_%H%M%S')}.json"
        )
        
        filepath = self.output_dir / filename
        
        with open(filepath, 'w') as f:
            json.dump(asdict(result), f, indent=2, default=str)
        
        logger.info(f"Results saved to {filepath}")
    
    def generate_report(
        self, 
        results: List[BenchmarkResult]
    ) -> str:
        """
        生成评估报告
        
        Args:
            results: 多个 Agent 的评估结果
        
        Returns:
            Markdown 格式的报告
        """
        report = f"""# Agent Evaluation Report

**Benchmark:** {self.benchmark_config.name} v{self.benchmark_config.version}
**Generated:** {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}

## Summary

| Agent | Model | Success Rate | Consistency | Avg Steps |
|-------|-------|-------------|-------------|-----------|
"""
        for r in results:
            report += (
                f"| {r.agent_name} | {r.agent_model} | "
                f"{r.metrics['success_rate']:.2%} | "
                f"{r.metrics['consistency_rate']:.2%} | "
                f"{r.metrics['avg_steps']:.1f} |\n"
            )
        
        report += """
## Performance by Difficulty

"""
        for r in results:
            report += f"### {r.agent_name}\n\n"
            for key, value in r.metrics.items():
                if key.startswith("sr_"):
                    difficulty = key[3:]
                    report += f"- **{difficulty}**: {value:.2%}\n"
            report += "\n"
        
        report += """
## Performance by Capability

"""
        for r in results:
            report += f"### {r.agent_name}\n\n"
            for key, value in r.metrics.items():
                if key.startswith("cap_"):
                    capability = key[4:]
                    report += f"- **{capability}**: {value:.2%}\n"
            report += "\n"
        
        report += """
## Recommendations

Based on the evaluation results:

1. **Strengths**: Identify which capabilities each agent excels at
2. **Weaknesses**: Focus improvement efforts on low-scoring areas
3. **Efficiency**: Consider the trade-off between accuracy and resource usage
4. **Reliability**: High consistency indicates more predictable behavior

---

*Report generated automatically by Agent Evaluation System*
"""
        return report


# 使用示例
async def main():
    """主函数示例"""
    
    # 1. 配置 Benchmark
    benchmark_config = BenchmarkConfig(
        name="CustomAgentBench",
        version="1.0.0",
        environments=["web"],
        max_steps=30,
        timeout_seconds=300,
        num_runs=3
    )
    
    # 2. 初始化评估系统
    eval_system = AgentEvaluationSystem(benchmark_config)
    eval_system.setup(example_templates)
    
    # 3. 配置要评估的 Agent
    agents_to_evaluate = [
        AgentConfig(
            name="GPT-4-Agent",
            model="gpt-4",
            temperature=0.0
        ),
        AgentConfig(
            name="Claude-Agent",
            model="claude-3-opus",
            temperature=0.0
        )
    ]
    
    # 4. 运行评估
    all_results = []
    for agent_config in agents_to_evaluate:
        # 创建 Agent 实例（根据配置）
        # agent = create_agent(agent_config)
        agent = None  # 占位符
        
        # result = await eval_system.evaluate_agent(agent, agent_config)
        # all_results.append(result)
    
    # 5. 生成报告
    # report = eval_system.generate_report(all_results)
    # print(report)
    
    print("Evaluation system setup complete!")

if __name__ == "__main__":
    asyncio.run(main())

7. 评估结果分析与解读 📊

7.1 结果可视化

import matplotlib.pyplot as plt
import numpy as np
from typing import Dict, List, Any

class EvaluationVisualizer:
    """评估结果可视化"""
    
    @staticmethod
    def plot_radar_chart(
        results: Dict[str, Dict[str, float]],
        capabilities: List[str],
        title: str = "Agent Capability Comparison"
    ) -> None:
        """
        绘制雷达图比较不同 Agent 的能力
        
        Args:
            results: {agent_name: {capability: score}}
            capabilities: 能力维度列表
            title: 图表标题
        """
        # 设置雷达图参数
        angles = np.linspace(0, 2 * np.pi, len(capabilities), endpoint=False)
        angles = np.concatenate((angles, [angles[0]]))
        
        fig, ax = plt.subplots(figsize=(10, 8), subplot_kw=dict(polar=True))
        
        colors = plt.cm.Set2(np.linspace(0, 1, len(results)))
        
        for (agent_name, scores), color in zip(results.items(), colors):
            values = [scores.get(cap, 0) for cap in capabilities]
            values = values + [values[0]]  # 闭合
            
            ax.plot(angles, values, 'o-', linewidth=2, label=agent_name, color=color)
            ax.fill(angles, values, alpha=0.25, color=color)
        
        ax.set_xticks(angles[:-1])
        ax.set_xticklabels(capabilities)
        ax.set_ylim(0, 1)
        ax.set_title(title, size=14, fontweight='bold')
        ax.legend(loc='upper right', bbox_to_anchor=(1.3, 1.0))
        
        plt.tight_layout()
        plt.savefig('radar_comparison.png', dpi=150, bbox_inches='tight')
        plt.show()
    
    @staticmethod
    def plot_difficulty_progression(
        results: Dict[str, Dict[str, float]],
        difficulties: List[str] = ["easy", "medium", "hard", "expert"]
    ) -> None:
        """
        绘制难度递进图
        
        Args:
            results: {agent_name: {difficulty: success_rate}}
            difficulties: 难度级别列表
        """
        fig, ax = plt.subplots(figsize=(10, 6))
        
        x = np.arange(len(difficulties))
        width = 0.8 / len(results)
        
        for i, (agent_name, scores) in enumerate(results.items()):
            values = [scores.get(d, 0) for d in difficulties]
            offset = (i - len(results) / 2 + 0.5) * width
            bars = ax.bar(x + offset, values, width, label=agent_name)
            
            # 添加数值标签
            for bar, val in zip(bars, values):
                ax.annotate(f'{val:.1%}',
                           xy=(bar.get_x() + bar.get_width() / 2, bar.get_height()),
                           xytext=(0, 3),
                           textcoords="offset points",
                           ha='center', va='bottom', fontsize=8)
        
        ax.set_xlabel('Difficulty Level')
        ax.set_ylabel('Success Rate')
        ax.set_title('Performance by Difficulty')
        ax.set_xticks(x)
        ax.set_xticklabels(difficulties)
        ax.set_ylim(0, 1.1)
        ax.legend()
        ax.grid(axis='y', alpha=0.3)
        
        plt.tight_layout()
        plt.savefig('difficulty_progression.png', dpi=150)
        plt.show()
    
    @staticmethod
    def plot_efficiency_frontier(
        results: List[Dict[str, Any]],
        x_metric: str = "cost",
        y_metric: str = "success_rate"
    ) -> None:
        """
        绘制效率前沿图（帕累托前沿）
        
        Args:
            results: 包含各指标的结果列表
            x_metric: X 轴指标（通常是成本）
            y_metric: Y 轴指标（通常是成功率）
        """
        fig, ax = plt.subplots(figsize=(10, 6))
        
        xs = [r[x_metric] for r in results]
        ys = [r[y_metric] for r in results]
        names = [r["agent_name"] for r in results]
        
        # 绘制散点
        scatter = ax.scatter(xs, ys, s=100, c=range(len(results)), 
                            cmap='viridis', alpha=0.7)
        
        # 添加标签
        for x, y, name in zip(xs, ys, names):
            ax.annotate(name, (x, y), xytext=(5, 5), 
                       textcoords='offset points', fontsize=9)
        
        # 计算并绘制帕累托前沿
        pareto_points = EvaluationVisualizer._compute_pareto_front(
            list(zip(xs, ys))
        )
        if pareto_points:
            pareto_x, pareto_y = zip(*sorted(pareto_points))
            ax.plot(pareto_x, pareto_y, 'r--', linewidth=2, 
                   label='Pareto Frontier', alpha=0.7)
        
        ax.set_xlabel(f'{x_metric.replace("_", " ").title()}')
        ax.set_ylabel(f'{y_metric.replace("_", " ").title()}')
        ax.set_title('Efficiency Frontier')
        ax.legend()
        ax.grid(True, alpha=0.3)
        
        plt.tight_layout()
        plt.savefig('efficiency_frontier.png', dpi=150)
        plt.show()
    
    @staticmethod
    def _compute_pareto_front(points: List[tuple]) -> List[tuple]:
        """计算帕累托前沿"""
        pareto = []
        for p1 in points:
            dominated = False
            for p2 in points:
                # 对于成本-性能，我们要低成本高性能
                if p2[0] < p1[0] and p2[1] > p1[1]:
                    dominated = True
                    break
            if not dominated:
                pareto.append(p1)
        return pareto

7.2 统计显著性检验

from scipy import stats
import numpy as np
from typing import List, Tuple

class StatisticalAnalysis:
    """统计分析工具"""
    
    @staticmethod
    def paired_t_test(
        scores_a: List[float],
        scores_b: List[float],
        alpha: float = 0.05
    ) -> Tuple[float, float, bool]:
        """
        配对 t 检验
        
        Args:
            scores_a: Agent A 的分数列表
            scores_b: Agent B 的分数列表
            alpha: 显著性水平
        
        Returns:
            (t_statistic, p_value, is_significant)
        """
        t_stat, p_value = stats.ttest_rel(scores_a, scores_b)
        is_significant = p_value < alpha
        
        return t_stat, p_value, is_significant
    
    @staticmethod
    def bootstrap_confidence_interval(
        scores: List[float],
        confidence: float = 0.95,
        n_bootstrap: int = 10000
    ) -> Tuple[float, float, float]:
        """
        Bootstrap 置信区间
        
        Args:
            scores: 分数列表
            confidence: 置信水平
            n_bootstrap: Bootstrap 采样次数
        
        Returns:
            (mean, lower_bound, upper_bound)
        """
        scores = np.array(scores)
        bootstrap_means = []
        
        for _ in range(n_bootstrap):
            sample = np.random.choice(scores, size=len(scores), replace=True)
            bootstrap_means.append(np.mean(sample))
        
        mean = np.mean(scores)
        lower = np.percentile(bootstrap_means, (1 - confidence) / 2 * 100)
        upper = np.percentile(bootstrap_means, (1 + confidence) / 2 * 100)
        
        return mean, lower, upper
    
    @staticmethod
    def effect_size_cohens_d(
        group_a: List[float],
        group_b: List[float]
    ) -> float:
        """
        计算 Cohen's d 效应量
        
        Args:
            group_a: 组 A 的分数
            group_b: 组 B 的分数
        
        Returns:
            Cohen's d 值
        """
        n_a, n_b = len(group_a), len(group_b)
        var_a, var_b = np.var(group_a, ddof=1), np.var(group_b, ddof=1)
        
        # 合并标准差
        pooled_std = np.sqrt(
            ((n_a - 1) * var_a + (n_b - 1) * var_b) / (n_a + n_b - 2)
        )
        
        return (np.mean(group_a) - np.mean(group_b)) / pooled_std
    
    @staticmethod
    def interpret_effect_size(d: float) -> str:
        """解释效应量"""
        d = abs(d)
        if d < 0.2:
            return "negligible"
        elif d < 0.5:
            return "small"
        elif d < 0.8:
            return "medium"
        else:
            return "large"

7.3 错误分析

from collections import Counter
from typing import Dict, List, Any

class ErrorAnalyzer:
    """错误分析器"""
    
    def __init__(self, results: List[Dict[str, Any]]):
        self.results = results
        self.failed_tasks = [r for r in results if not r.get("success")]
        
    def categorize_errors(self) -> Dict[str, int]:
        """
        对错误进行分类
        
        Returns:
            错误类别计数
        """
        error_categories = []
        
        for task in self.failed_tasks:
            category = self._determine_error_category(task)
            error_categories.append(category)
        
        return dict(Counter(error_categories))
    
    def _determine_error_category(self, task: Dict[str, Any]) -> str:
        """判断错误类别"""
        trajectory = task.get("trajectory", [])
        
        if not trajectory:
            return "no_action"
        
        last_action = trajectory[-1]
        
        # 检查是否超时
        if len(trajectory) >= task.get("max_steps", 30):
            return "timeout"
        
        # 检查是否有 API 错误
        if any(step.get("error") for step in trajectory):
            return "api_error"
        
        # 检查是否在循环
        actions = [str(step.get("action")) for step in trajectory]
        if len(actions) > 5 and len(set(actions[-5:])) < 2:
            return "loop_detected"
        
        # 检查是否答案错误
        if task.get("final_answer"):
            return "wrong_answer"
        
        return "incomplete"
    
    def get_common_failure_patterns(
        self, 
        top_k: int = 5
    ) -> List[Dict[str, Any]]:
        """
        获取常见失败模式
        
        Args:
            top_k: 返回前 k 个模式
        
        Returns:
            失败模式列表
        """
        patterns = []
        
        for task in self.failed_tasks:
            pattern = self._extract_failure_pattern(task)
            patterns.append(pattern)
        
        # 按相似度聚类
        pattern_groups = self._cluster_patterns(patterns)
        
        # 返回最常见的
        return sorted(
            pattern_groups,
            key=lambda x: x["count"],
            reverse=True
        )[:top_k]
    
    def _extract_failure_pattern(
        self, 
        task: Dict[str, Any]
    ) -> Dict[str, Any]:
        """提取失败模式"""
        return {
            "task_type": task.get("task_type"),
            "difficulty": task.get("difficulty"),
            "steps_taken": len(task.get("trajectory", [])),
            "error_category": self._determine_error_category(task),
            "last_actions": [
                str(step.get("action"))[:50] 
                for step in task.get("trajectory", [])[-3:]
            ]
        }
    
    def _cluster_patterns(
        self, 
        patterns: List[Dict[str, Any]]
    ) -> List[Dict[str, Any]]:
        """聚类相似模式"""
        # 简化实现：按错误类别分组
        groups = {}
        for p in patterns:
            key = p["error_category"]
            if key not in groups:
                groups[key] = {
                    "error_category": key,
                    "examples": [],
                    "count": 0
                }
            groups[key]["examples"].append(p)
            groups[key]["count"] += 1
        
        return list(groups.values())
    
    def generate_error_report(self) -> str:
        """生成错误分析报告"""
        
        error_counts = self.categorize_errors()
        failure_patterns = self.get_common_failure_patterns()
        
        report = """
# Error Analysis Report

## Error Distribution

| Error Category | Count | Percentage |
|---------------|-------|------------|
"""
        total_errors = sum(error_counts.values())
        for category, count in sorted(error_counts.items(), 
                                      key=lambda x: x[1], reverse=True):
            pct = count / total_errors * 100 if total_errors > 0 else 0
            report += f"| {category} | {count} | {pct:.1f}% |\n"
        
        report += """
## Common Failure Patterns

"""
        for i, pattern in enumerate(failure_patterns, 1):
            report += f"""### Pattern {i}: {pattern['error_category']}

- **Frequency**: {pattern['count']} occurrences
- **Example tasks**: {len(pattern['examples'])} cases

"""
        
        report += """
## Recommendations

Based on the error analysis:

1. **Most common error**: Focus on addressing the top error category
2. **Timeout issues**: Consider increasing step limits or optimizing prompts
3. **Loop detection**: Implement better action diversity mechanisms
4. **API errors**: Improve error handling and retry logic

"""
        return report

8. 未来展望与发展趋势 🔮

8.1 当前挑战

┌─────────────────────────────────────────────────────────────────┐
│                    Agent 评估的当前挑战                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ 1. 评估成本高昂                                          │   │
│  │    • 真实环境运行消耗大量资源                             │   │
│  │    • LLM API 调用费用高                                  │   │
│  │    • 人工评估效率低                                       │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ 2. 标准化困难                                            │   │
│  │    • 缺乏统一的评估框架                                   │   │
│  │    • 不同 Benchmark 指标不可比                           │   │
│  │    • 环境配置差异大                                       │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ 3. 真实性与可控性平衡                                    │   │
│  │    • 模拟环境偏离真实场景                                 │   │
│  │    • 真实环境难以可复现                                   │   │
│  │    • 安全风险难以控制                                     │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ 4. 能力边界不明确                                        │   │
│  │    • 难以定义 "智能" 的边界                              │   │
│  │    • 新能力涌现难以预测                                   │   │
│  │    • 评估体系需要持续更新                                 │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

8.2 发展趋势

1. 动态自适应评估

class AdaptiveBenchmark:
    """自适应 Benchmark"""
    
    def __init__(self, difficulty_estimator):
        self.estimator = difficulty_estimator
        self.task_pool = {}
        
    def select_next_task(
        self,
        agent_performance: List[Dict[str, Any]]
    ) -> GeneratedTask:
        """
        根据 Agent 表现动态选择下一个任务
        
        类似计算机自适应测试 (CAT) 的思想
        """
        # 估计 Agent 当前能力水平
        estimated_ability = self._estimate_ability(agent_performance)
        
        # 选择最具信息量的任务（难度匹配）
        best_task = None
        max_info = 0
        
        for task in self.task_pool.values():
            info = self._compute_information(task, estimated_ability)
            if info > max_info:
                max_info = info
                best_task = task
        
        return best_task
    
    def _estimate_ability(
        self, 
        performance: List[Dict[str, Any]]
    ) -> float:
        """使用 IRT 模型估计能力"""
        # Item Response Theory 实现
        pass
    
    def _compute_information(
        self, 
        task: GeneratedTask, 
        ability: float
    ) -> float:
        """计算任务的信息量"""
        # Fisher 信息
        pass

2. 多模态评估

┌─────────────────────────────────────────────────────────────────┐
│                    多模态 Agent 评估框架                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   输入模态                      评估维度                         │
│   ─────────                     ─────────                       │
│   ┌─────────┐                   ┌─────────────────────────┐     │
│   │  文本   │─────────────────→│ 语言理解与生成            │     │
│   └─────────┘                   └─────────────────────────┘     │
│   ┌─────────┐                   ┌─────────────────────────┐     │
│   │  图像   │─────────────────→│ 视觉感知与推理            │     │
│   └─────────┘                   └─────────────────────────┘     │
│   ┌─────────┐                   ┌─────────────────────────┐     │
│   │  音频   │─────────────────→│ 听觉理解与响应            │     │
│   └─────────┘                   └─────────────────────────┘     │
│   ┌─────────┐                   ┌─────────────────────────┐     │
│   │  视频   │─────────────────→│ 时序动态理解              │     │
│   └─────────┘                   └─────────────────────────┘     │
│   ┌─────────┐                   ┌─────────────────────────┐     │
│   │ 3D/具身 │─────────────────→│ 空间推理与物理交互        │     │
│   └─────────┘                   └─────────────────────────┘     │
│                                                                 │
│                      ↓ 整合评估 ↓                               │
│                                                                 │
│              ┌─────────────────────────────┐                    │
│              │   跨模态融合与协调能力评估   │                    │
│              └─────────────────────────────┘                    │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

3. 持续学习评估

class ContinualLearningBenchmark:
    """持续学习评估"""
    
    def __init__(self):
        self.task_sequence = []
        self.performance_history = []
        
    def evaluate_forward_transfer(
        self,
        agent,
        new_task_results: List[Dict],
        baseline: float
    ) -> float:
        """
        评估前向迁移能力
        
        学习新任务是否受益于之前学到的知识
        """
        avg_performance = sum(
            r["score"] for r in new_task_results
        ) / len(new_task_results)
        
        return avg_performance - baseline
    
    def evaluate_backward_transfer(
        self,
        agent,
        old_task_id: str
    ) -> float:
        """
        评估后向迁移/遗忘
        
        学习新任务后，旧任务的表现是否下降
        """
        # 获取学习新任务前的旧任务表现
        old_performance = self._get_historical_performance(old_task_id)
        
        # 获取当前的旧任务表现
        current_performance = self._evaluate_task(agent, old_task_id)
        
        # 负值表示遗忘
        return current_performance - old_performance
    
    def compute_plasticity_stability_tradeoff(
        self,
        forward_transfers: List[float],
        backward_transfers: List[float]
    ) -> Dict[str, float]:
        """
        计算可塑性-稳定性权衡
        """
        plasticity = sum(forward_transfers) / len(forward_transfers)
        stability = sum(backward_transfers) / len(backward_transfers)
        
        return {
            "plasticity": plasticity,
            "stability": stability,
            "combined": (plasticity + stability) / 2
        }

4. 人机协作评估

class HumanAgentCollaborationBenchmark:
    """人机协作评估"""
    
    def evaluate_collaboration(
        self,
        agent,
        human_simulator,  # 模拟人类行为
        task: Dict[str, Any]
    ) -> Dict[str, float]:
        """
        评估人机协作效果
        """
        # 协作完成任务
        result = self._run_collaborative_task(
            agent, human_simulator, task
        )
        
        # 计算协作指标
        metrics = {
            # 任务完成质量
            "task_completion": result["completion_score"],
            
            # 沟通效率
            "communication_efficiency": self._compute_comm_efficiency(
                result["dialogue_turns"]
            ),
            
            # 角色分工合理性
            "role_distribution": self._compute_role_balance(
                result["agent_actions"],
                result["human_actions"]
            ),
            
            # 人类满意度（模拟）
            "human_satisfaction": human_simulator.get_satisfaction(),
            
            # 互补性
            "complementarity": self._compute_complementarity(
                agent, human_simulator, task
            )
        }
        
        return metrics

8.3 研究前沿

┌─────────────────────────────────────────────────────────────────┐
│                    Agent 评估研究前沿                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  1. 因果推理评估                                                 │
│     • 区分相关性与因果性                                         │
│     • 反事实推理能力                                             │
│     • 干预效果预测                                               │
│                                                                 │
│  2. 元认知评估                                                   │
│     • Agent 对自身能力的认知                                     │
│     • 不确定性量化                                               │
│     • 知道自己不知道什么                                         │
│                                                                 │
│  3. 社会智能评估                                                 │
│     • 多 Agent 协作                                              │
│     • 社会规范遵守                                               │
│     • 博弈与谈判能力                                             │
│                                                                 │
│  4. 创造力评估                                                   │
│     • 开放式问题解决                                             │
│     • 新颖解决方案生成                                           │
│     • 跨领域知识迁移                                             │
│                                                                 │
│  5. 鲁棒性评估                                                   │
│     • 对抗性输入抵御                                             │
│     • 分布外泛化                                                 │
│     • 长尾场景处理                                               │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

9. 总结 📝

本文深入探讨了 AI Agent 评估领域的核心技术和方法论，涵盖了从理论到实践的完整知识体系。

关键要点回顾

1. Agent 评估的特殊性

不同于传统 NLP 任务的输入-输出评估
需要考虑多步骤推理、动态环境、过程质量
评估维度更加多元化

2. 主流 Benchmark

AgentBench：综合性评估，覆盖 8 种环境
ToolBench：专注工具调用能力，16000+ API
WebArena：真实网页交互，可复现环境

3. 评估指标体系

任务完成度：成功率、部分完成分、进度率
过程质量：推理连贯性、错误恢复率
效率成本：步数效率、token 效率、时间效率
安全可靠：安全分数、一致性、边界遵守

4. Benchmark 设计方法论

五大原则：真实性、可复现性、区分度、防污染性、全面性
任务构建：模板化、参数化、难度分级
污染防护：私有测试集、动态生成、定期更新

5. 未来趋势

自适应评估
多模态能力
持续学习
人机协作
创造力与元认知

实践建议

💡 对于 Agent 开发者：

在多个 Benchmark 上测试，避免过拟合单一数据集
关注过程质量，不仅仅是最终结果
进行充分的错误分析，找出能力短板
考虑效率和成本，追求实用性

💡 对于 Benchmark 设计者：

确保任务来源于真实需求
建立完善的污染防护机制
提供详细的评估指南和基线
定期更新以跟上技术发展

💡 对于研究者：

开发新的评估维度（如创造力、社会智能）
探索更高效的评估方法
研究评估结果的可解释性
关注评估公平性和偏见问题

Agent 评估是一个快速发展的领域，随着 Agent 能力的不断增强，评估方法也需要持续演进。希望本文能为读者提供一个全面的知识框架，助力 Agent 技术的健康发展。

参考文献

[1] Liu, X., Yu, H., Zhang, H., Xu, Y., Lei, X., Lai, H., … & Wen, J. R. (2023). AgentBench: Evaluating LLMs as Agents. arXiv preprint arXiv:2308.03688.

[2] Qin, Y., Liang, S., Ye, Y., Zhu, K., Yan, L., Lu, Y., … & Sun, M. (2023). ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs. arXiv preprint arXiv:2307.16789.

[3] Zhou, S., Xu, F. F., Zhu, H., Zhou, X., Lo, R., Sridhar, A., … & Neubig, G. (2023). WebArena: A Realistic Web Environment for Building Autonomous Agents. arXiv preprint arXiv:2307.13854.

[4] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. (2022). ReAct: Synergizing Reasoning and Acting in Language Models. arXiv preprint arXiv:2210.03629.

[5] Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K., & Yao, S. (2023). Reflexion: Language Agents with Verbal Reinforcement Learning. arXiv preprint arXiv:2303.11366.

[6] Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., & Narasimhan, K. (2023). SWE-bench: Can Language Models Resolve Real-World GitHub Issues? arXiv preprint arXiv:2310.06770.

[7] Mialon, G., Dessì, R., Lomeli, M., Nalmpantis, C., Pasunuru, R., Raileanu, R., … & Scialom, T. (2023). Augmented Language Models: a Survey. arXiv preprint arXiv:2302.07842.

[8] Wang, L., Ma, C., Feng, X., Zhang, Z., Yang, H., Zhang, J., … & Wang, Y. (2023). A Survey on Large Language Model based Autonomous Agents. arXiv preprint arXiv:2308.11432.

[9] Ruan, J., Chen, Y., Zhang, B., Xu, Z., Bao, T., Du, G., … & Guo, H. (2023). TPTU: Task Planning and Tool Usage of Large Language Model-based AI Agents. arXiv preprint arXiv:2308.03427.

[10] Deng, X., Gu, Y., Zheng, B., Chen, S., Stevens, S., Wang, B., … & Yu, J. (2023). Mind2Web: Towards a Generalist Agent for the Web. arXiv preprint arXiv:2306.06070.

[11] Xie, T., Zhang, D., Chen, J., Li, X., Zhao, S., Cao, R., … & Yu, T. (2023). OpenAgents: An Open Platform for Language Agents in the Wild. arXiv preprint arXiv:2310.10634.

[12] Patil, S. G., Zhang, T., Wang, X., & Gonzalez, J. E. (2023). Gorilla: Large Language Model Connected with Massive APIs. arXiv preprint arXiv:2305.15334.

[13] Yang, Z., Liu, J., Han, J., Chen, X., Ding, Z., & Xie, J. (2023). API-Bank: A Benchmark for Tool-Augmented LLMs. arXiv preprint arXiv:2304.08244.

[14] Shen, Y., Song, K., Tan, X., Li, D., Lu, W., & Zhuang, Y. (2023). HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face. arXiv preprint arXiv:2303.17580.

[15] Park, J. S., O’Brien, J. C., Cai, C. J., Morris, M. R., Liang, P., & Bernstein, M. S. (2023). Generative Agents: Interactive Simulacra of Human Behavior. arXiv preprint arXiv:2304.03442.

[16] Wu, Q., Bansal, G., Zhang, J., Wu, Y., Zhang, S., Zhu, E., … & Wang, C. (2023). AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation Framework. arXiv preprint arXiv:2308.08155.

[17] Hong, S., Zheng, X., Chen, J., Cheng, Y., Wang, J., Zhang, C., … & Wu, Y. (2023). MetaGPT: Meta Programming for Multi-Agent Collaborative Framework. arXiv preprint arXiv:2308.00352.

[18] Schick, T., Dwivedi-Yu, J., Dessì, R., Raileanu, R., Lomeli, M., Zettlemoyer, L., … & Scialom, T. (2023). Toolformer: Language Models Can Teach Themselves to Use Tools. arXiv preprint arXiv:2302.04761.

[19] Nakano, R., Hilton, J., Balaji, S., Wu, J., Ouyang, L., Kim, C., … & Schulman, J. (2021). WebGPT: Browser-assisted Question-answering with Human Feedback. arXiv preprint arXiv:2112.09332.

[20] Zhuang, Y., Yu, Y., Wang, K., Sun, H., & Zhang, C. (2023). ToolQA: A Dataset for LLM Question Answering with External Tools. arXiv preprint arXiv:2306.13304.

[21] Lu, P., Peng, B., Cheng, H., Galley, M., Chang, K. W., Wu, Y. N., … & Gao, J. (2023). Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models. arXiv preprint arXiv:2304.09842.

[22] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T. L., Cao, Y., & Narasimhan, K. (2023). Tree of Thoughts: Deliberate Problem Solving with Large Language Models. arXiv preprint arXiv:2305.10601.

[23] Besta, M., Blach, N., Kubicek, A., Gerstenberger, R., Gianinazzi, L., Gajber, J., … & Hoefler, T. (2023). Graph of Thoughts: Solving Elaborate Problems with Large Language Models. arXiv preprint arXiv:2308.09687.

[24] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., … & Zhou, D. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. Advances in Neural Information Processing Systems, 35, 24824-24837.

[25] Huang, W., Abbeel, P., Pathak, D., & Mordatch, I. (2022). Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents. International Conference on Machine Learning, 9118-9147.

[26] Ahn, M., Brohan, A., Brown, N., Chebotar, Y., Cortes, O., David, B., … & Ichter, B. (2022). Do As I Can, Not As I Say: Grounding Language in Robotic Affordances. arXiv preprint arXiv:2204.01691.

[27] Shridhar, M., Thomason, J., Gordon, D., Bisk, Y., Han, W., Mottaghi, R., … & Fox, D. (2020). ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10740-10749.

[28] Côté, M. A., Kádár, Á., Yuan, X., Kyber, B., Barnes, T., Fine, E., … & Taschereau-Dumouchel, V. (2018). TextWorld: A Learning Environment for Text-based Games. Workshop on Computer Games, 41-75.

[29] Fan, L., Wang, G., Jiang, Y., Mandlekar, A., Yang, Y., Zhu, H., … & Anandkumar, A. (2022). MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge. Advances in Neural Information Processing Systems, 35, 18343-18362.

[30] Huang, S., Papernot, N., Goodfellow, I., Duan, Y., & Abbeel, P. (2017). Adversarial Attacks on Neural Network Policies. arXiv preprint arXiv:1702.02284.

[31] Zheng, L., Chiang, W. L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., … & Stoica, I. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. arXiv preprint arXiv:2306.05685.

[32] Li, X., Zhang, T., Dubois, Y., Taori, R., Gulrajani, I., Guestrin, C., … & Hashimoto, T. B. (2023). AlpacaEval: An Automatic Evaluator of Instruction-following Models. GitHub repository.

[33] Mallen, A., Asai, A., Zhong, V., Das, R., Khashabi, D., & Hajishirzi, H. (2023). When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories. arXiv preprint arXiv:2212.10511.

[34] Lin, B. Y., Ravichander, A., Lu, X., Dziri, N., Sclar, M., Chandu, K., … & Choi, Y. (2021). TruthfulQA: Measuring How Models Mimic Human Falsehoods. arXiv preprint arXiv:2109.07958.

[35] Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., … & Lowe, R. (2022). Training Language Models to Follow Instructions with Human Feedback. Advances in Neural Information Processing Systems, 35, 27730-27744.

[36] OpenAI. (2023). GPT-4 Technical Report. arXiv preprint arXiv:2303.08774.

[37] Anthropic. (2023). Claude 2 Technical Report. Anthropic Blog.

[38] Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., … & Scialom, T. (2023). Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv preprint arXiv:2307.09288.

[39] Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., … & Zhang, Y. (2023). Sparks of Artificial General Intelligence: Early Experiments with GPT-4. arXiv preprint arXiv:2303.12712.

[40] Xi, Z., Chen, W., Guo, X., He, W., Ding, Y., Hong, B., … & Gui, T. (2023). The Rise and Potential of Large Language Model Based Agents: A Survey. arXiv preprint arXiv:2309.07864.