前言

大模型 API 服务对可用性和性能有极高要求,任何响应延迟或服务中断都会直接影响用户体验和企业业务。本篇文章深入讲解 Sentinel 熔断降级配置、Redis 分布式缓存高可用、MySQL 主从复制与读写分离、接口限流与排队合并策略、异步化与线程池优化等核心主题,配套四张技术架构图帮助读者构建高可用、高性能的大模型服务架构。

────────────────────────────────────────────────────────────

一、Sentinel 熔断降级配置

1.1 熔断器模式核心原理

熔断器模式是防止雪崩效应的关键机制。熔断器有三种状态:Closed(关闭)——正常请求通过,失败计数累加;Open(打开)——所有请求快速失败,返回降级响应;Half-Open(半开)——尝试放行少量请求,若成功则恢复关闭状态,否则继续保持打开。

1.2 Sentinel 快速入门

<dependency>
    <groupId>com.alibaba.csp</groupId>
    <artifactId>sentinel-core</artifactId>
    <version>1.8.6</version>
</dependency>
<dependency>
    <groupId>com.alibaba.csp</groupId>
    <artifactId>sentinel-transport-simple-http</artifactId>
    <version>1.8.6</version>
</dependency>

1.3 熔断降级规则配置

import com.alibaba.csp.sentinel.slots.block.degrade.DegradeRule;
import com.alibaba.csp.sentinel.slots.block.degrade.DegradeRuleManager;
import com.alibaba.csp.sentinel.slots.block.RuleConstant;
import java.util.Arrays;

public class SentinelDegradeConfig {

    public static void initDegradeRules() {
        DegradeRule degradeRule = new DegradeRule("llmApi")
                .setGrade(RuleConstant.DEGRADE_GRADE_RT)  //
RT 熔断
                .setCount(2000)  // 阈值:2
                .setSlowRatioThreshold(0.5)  // 50% 慢调用比例
                .setMinRequestAmount(5)  // 最小请求数
                .setStatIntervalMs(60000)  // 统计窗口:1分钟
                .setTimeWindow(30);  // 熔断时长:30

        DegradeRule exceptionRule = new DegradeRule("llmApiException")
                .setGrade(RuleConstant.DEGRADE_GRADE_EXCEPTION_RATIO)
                .setCount(0.3)  // 30%
异常比例
                .setMinRequestAmount(5)
                .setTimeWindow(60);

        DegradeRuleManager.loadRules(Arrays.asList(degradeRule, exceptionRule));
    }
}

1.4 Sentinel + Spring Boot 集成

@Aspect
@Component
public class SentinelAspect {

    @Around("@annotation(SentinelResource)")
    public Object around(ProceedingJoinPoint pjp, SentinelResource annotation) throws Throwable {
        String resourceName = annotation.value();
        Entry entry = null;

        try {
            entry = SphU.entry(resourceName);
            return pjp.proceed();
        } catch (BlockException e) {
            //
触发限流或熔断,执行降级逻辑
            return handleBlock(annotation.fallback());
        } finally {
            if (entry != null) {
                entry.exit();
            }
        }
    }

    private Object handleBlock(String fallbackName) {
        //
返回降级响应
        return "{\"error\":\"service_degraded\",\"message\":\"服务暂时繁忙,请稍后重试\"}";
    }
}

@SentinelResource(value = "chatCompletion", fallback = "chatFallback")
public ChatResponse chat(ChatRequest request) {
    return llmService.chat(request);
}

public ChatResponse chatFallback(ChatRequest request, Throwable t) {
    //
降级逻辑:返回缓存结果或友好提示
    return ChatResponse.degraded("服务降级,请稍后重试");
}

────────────────────────────────────────────────────────────

二、Redis 分布式缓存高可用

2.1 Redis Cluster 架构

Redis Cluster 采用槽(Slot)分片机制,3 主 3 从架构下,每个主节点负责 5460 个槽,通过 Gossip 协议进行节点间通信,实现自动故障转移。

                    ┌─────────────────┐
                    │  Redis Cluster  │
                    │   (3
3 )    │
                    └────────┬────────┘
                             │
    ┌──────────┬──────────┬─┴─┬──────────┬──────────┐
    │Master-1  │Master-2  │Master-3│Replica-1│Replica-2│Replica-3│
    │ Slot:0-  │Slot:5460-│Slot:10922-│ (
同步)  │ (同步)  │ (同步)  │
    │  5460    │  10922   │  16383   │         │         │         │
    └──────────┴──────────┴──────────┴─────────┴─────────┴─────────┘

2.2 多级缓存设计

大模型场景推荐多级缓存架构:L1 本地缓存——Caffeine(Guava Cache 替代品),存储热点数据,毫秒级访问;L2 分布式缓存——Redis Cluster,存储共享数据,支持跨节点访问;L3 持久化存储——MySQL,存储最终数据。

@Configuration
public class CacheConfig {

    @Bean
    public Cache<String, Object> localCache() {
        return Caffeine.newBuilder()
                .maximumSize(10000)
                .expireAfterWrite(1, TimeUnit.MINUTES)
                .recordStats()
                .build();
    }

    @Bean
    public RedisTemplate<String, Object> redisTemplate(RedisConnectionFactory factory) {
        RedisTemplate<String, Object> template = new RedisTemplate<>();
        template.setConnectionFactory(factory);
        template.setKeySerializer(new StringRedisSerializer());
        template.setValueSerializer(new GenericJackson2JsonRedisSerializer());
        template.setHashKeySerializer(new StringRedisSerializer());
        return template;
    }
}

@Service
public class LlmCacheService {

    @Autowired
    private Cache<String, Object> localCache;

    @Autowired
    private RedisTemplate<String, Object> redisTemplate;

    private static final String CACHE_PREFIX = "llm:";

    public Object getResponse(String prompt) {
        String key = hashPrompt(prompt);

        // L1:
本地缓存
        Object cached = localCache.getIfPresent(key);
        if (cached != null) {
            return cached;
        }

        // L2: Redis
缓存
        cached = redisTemplate.opsForValue().get(CACHE_PREFIX + key);
        if (cached != null) {
            localCache.put(key, cached);  //
回填 L1
            return cached;
        }

        return null;
    }

    public void cacheResponse(String prompt, Object response) {
        String key = hashPrompt(prompt);
        redisTemplate.opsForValue().set(CACHE_PREFIX + key, response, 1, TimeUnit.HOURS);
        localCache.put(key, response);
    }
}

2.3 缓存策略选择

策略

原理

适用场景

一致性

Cache-Aside

应用主导读写,缓存旁路

读多写少

最终一致

Read-Through

缓存自动加载

简化应用逻辑

最终一致

Write-Through

同步写缓存和存储

数据一致性要求高

强一致

Write-Behind

异步写回

写入性能要求高

最终一致

────────────────────────────────────────────────────────────

三、MySQL 主从复制与读写分离

3.1 主从复制原理

MySQL 主从复制基于 Binlog 实现,主库将所有写操作记录到 Binlog,从库通过 I/O 线程读取主库 Binlog 并写入 Relay Log,SQL 线程重放 Relay Log 实现数据同步。

复制模式对比:异步复制——主库提交事务后立即返回,不等待从库确认;半同步复制——主库等待至少一个从库确认才提交;组复制(MGR)——基于 Paxos 协议实现多主复制。

3.2 ShardingSphere 读写分离配置

spring:
  shardingsphere:
    datasource:
      ds-master:
        url: jdbc:mysql://192.168.1.10:3306/llm_db
        username: root
        password: ***
      ds-slave-0:
        url: jdbc:mysql://192.168.1.11:3306/llm_db
        username: root
        password: ***
      ds-slave-1:
        url: jdbc:mysql://192.168.1.12:3306/llm_db
        username: root
        password: ***
    rules:
      readwrite-splitting:
        data-sources:
          prd_ds:
            writeDataSourceName: ds-master
            readDataSourceNames:
              - ds-slave-0
              - ds-slave-1
            loadBalancerName: round_robin
        loadBalancers:
          round_robin:
            type: ROUND_ROBIN

3.3 分库分表设计

// user_id 分库分表
@ShardingAlgorithm(value = ModShardingAlgorithm.class, props = {
    @Property(name = "sharding-count", value = "4")
})
public class UserOrderTable implements ShardingTable {

    @Override
    public String doSharding(String targetTable, Collection<String> availableTargetTables,
                            ShardingValue<String> shardingValue) {
        String userId = shardingValue.getValue();
        int tableIndex = Math.abs(userId.hashCode()) % 4;
        return targetTable + "_" + tableIndex;
    }
}

//
使用 ShardingSphere-JDBC
try (Connection conn = dataSource.getConnection()) {
    PreparedStatement ps = conn.prepareStatement(
        "SELECT * FROM chat_history WHERE user_id = ?");
    ps.setLong(1, userId);
    ResultSet rs = ps.executeQuery();
}

────────────────────────────────────────────────────────────

四、接口限流与排队合并策略

4.1 令牌桶算法实现

令牌桶算法是 API 限流最常用的方案,允许一定程度的突发流量。

import java.util.concurrent.atomic.AtomicLong;
import java.util.concurrent.locks.ReentrantLock;

public class TokenBucketRateLimiter {
    private final long capacity;      //
桶容量
    private final long refillRate;   // 每秒补充令牌数
    private final AtomicLong tokens;
    private final AtomicLong lastRefillTime;
    private final ReentrantLock lock = new ReentrantLock();

    public TokenBucketRateLimiter(long capacity, long refillRate) {
        this.capacity = capacity;
        this.refillRate = refillRate;
        this.tokens = new AtomicLong(capacity);
        this.lastRefillTime = new AtomicLong(System.nanoTime());
    }

    public boolean tryAcquire() {
        lock.lock();
        try {
            refill();
            if (tokens.get() > 0) {
                tokens.decrementAndGet();
                return true;
            }
            return false;
        } finally {
            lock.unlock();
        }
    }

    private void refill() {
        long now = System.nanoTime();
        long lastTime = lastRefillTime.get();
        long elapsed = now - lastTime;

        //
计算应该补充的令牌数
        long tokensToAdd = (elapsed * refillRate) / 1_000_000_000L;
        if (tokensToAdd > 0) {
            long newTokens = Math.min(capacity, tokens.get() + tokensToAdd);
            tokens.set(newTokens);
            lastRefillTime.set(now);
        }
    }

    public long availableTokens() {
        refill();
        return tokens.get();
    }
}

4.2 Redis + Lua 原子限流

@Service
public class RedisRateLimiter {

    @Autowired
    private StringRedisTemplate redisTemplate;

    private static final String RATE_LIMIT_SCRIPT =
        "local key = KEYS[1] " +
        "local limit = tonumber(ARGV[1]) " +
        "local window = tonumber(ARGV[2]) " +
        "local current = redis.call('INCR', key) " +
        "if current == 1 then redis.call('EXPIRE', key, window) end " +
        "return current <= limit and 1 or 0";

    public boolean isAllowed(String userId, int limit, int windowSeconds) {
        String key = "rate_limit:" + userId;
        Long result = redisTemplate.execute(
            (RedisCallback<Long>) conn ->
                conn.eval(
                    RATE_LIMIT_SCRIPT.getBytes(),
                    RedisScript.of(String.class),
                    Collections.singletonList(key),
                    String.valueOf(limit),
                    String.valueOf(windowSeconds)
                )
        );
        return result != null && result == 1L;
    }
}

4.3 请求合并策略

对于相同 Prompt 的多个请求,可以合并为一个 LLM 调用,显著降低成本和延迟。

@Service
public class PromptMergingService {

    private final ConcurrentHashMap<String, CompletableFuture<String>> pendingRequests = new ConcurrentHashMap<>();
    private final ScheduledExecutorService scheduler = Executors.newSingleThreadScheduledExecutor();

    public CompletableFuture<String> getResponse(String prompt) {
        String key = hashPrompt(prompt);

        //
检查是否有正在处理的相同请求
        CompletableFuture<String> existing = pendingRequests.get(key);
        if (existing != null) {
            return existing;  //
复用进行中的请求
        }

        //
创建新请求
        CompletableFuture<String> future = new CompletableFuture<>();
        CompletableFuture<String> existing2 = pendingRequests.putIfAbsent(key, future);

        if (existing2 != null) {
            return existing2;  //
已被其他线程创建
        }

        //
延迟合并窗口(例如 50ms
        scheduler.schedule(() -> {
            pendingRequests.remove(key);
            executeAndComplete(key, prompt, future);
        }, 50, TimeUnit.MILLISECONDS);

        return future;
    }

    private void executeAndComplete(String key, String prompt, CompletableFuture<String> future) {
        try {
            String response = llmService.chat(prompt);
            future.complete(response);
        } catch (Exception e) {
            future.completeExceptionally(e);
        }
    }

    private String hashPrompt(String prompt) {
        try {
            MessageDigest md = MessageDigest.getInstance("SHA-256");
            byte[] hash = md.digest(prompt.getBytes(StandardCharsets.UTF_8));
            return Base64.getEncoder().encodeToString(hash).substring(0, 16);
        } catch (NoSuchAlgorithmException e) {
            return prompt.hashCode() + "";
        }
    }
}

────────────────────────────────────────────────────────────

五、异步化与线程池优化

5.1 异步调用设计

大模型 API 调用通常是 IO 密集型操作,异步化可以显著提升系统吞吐量。

@Service
public class AsyncLLMService {

    @Async("llmExecutor")
    public CompletableFuture<ChatResponse> chatAsync(ChatRequest request) {
        return CompletableFuture.supplyAsync(() -> {
            try {
                return llmService.chat(request);
            } catch (Exception e) {
                throw new CompletionException(e);
            }
        });
    }

    //
批量异步调用
    public CompletableFuture<List<ChatResponse>> batchChatAsync(List<ChatRequest> requests) {
        List<CompletableFuture<ChatResponse>> futures = requests.stream()
                .map(this::chatAsync)
                .collect(Collectors.toList());

        return CompletableFuture.allOf(futures.toArray(new CompletableFuture[0]))
                .thenApply(v -> futures.stream()
                        .map(CompletableFuture::join)
                        .collect(Collectors.toList()));
    }
}

@Configuration
public class AsyncConfig {

    @Bean("llmExecutor")
    public Executor llmExecutor() {
        ThreadPoolTaskExecutor executor = new ThreadPoolTaskExecutor();
        executor.setCorePoolSize(10);
        executor.setMaxPoolSize(50);
        executor.setQueueCapacity(200);
        executor.setKeepAliveSeconds(60);
        executor.setThreadNamePrefix("llm-async-");
        executor.setRejectedExecutionHandler(new ThreadPoolExecutor.CallerRunsPolicy());
        executor.initialize();
        return executor;
    }
}

5.2 线程池优化策略

核心参数调优原则:

  • CPU 密集型任务:核心线程数 = CPU 核心数 + 1,减少上下文切换
  • IO 密集型任务:核心线程数 = CPU 核心数 * 2 或更高,充分利用等待时间
  • 大模型场景:IO 密集型 + 等待时间长,建议核心线程数设为 2 * CPU 核心数,队列不宜过大

// 动态线程池配置
@Configuration
public class DynamicThreadPoolConfig {

    @Value("${llm.thread.core:10}")
    private int corePoolSize;

    @Value("${llm.thread.max:50}")
    private int maxPoolSize;

    @Value("${llm.thread.queue:200}")
    private int queueCapacity;

    @Bean("llmTaskExecutor")
    public TaskExecutor llmTaskExecutor() {
        ThreadPoolTaskExecutor executor = new ThreadPoolTaskExecutor();
        executor.setCorePoolSize(corePoolSize);
        executor.setMaxPoolSize(maxPoolSize);
        executor.setQueueCapacity(queueCapacity);
        executor.setThreadNamePrefix("llm-");
        executor.setWaitForTasksToCompleteOnShutdown(true);
        executor.setAwaitTerminationSeconds(60);
        //
拒绝策略:使用调用者线程执行
        executor.setRejectedExecutionHandler(new ThreadPoolExecutor.CallerRunsPolicy());
        executor.initialize();
        return executor;
    }
}

5.3 CompletableFuture 组合调用

@Service
public class LLMCompositionService {

    public ChatResponse chatWithFallback(String prompt, List<String> models) {
        //
尝试多个模型,任一成功即返回
        List<CompletableFuture<ChatResponse>> futures = models.stream()
                .map(model -> tryModel(prompt, model))
                .collect(Collectors.toList());

        return futures.stream()
                .map(CompletableFuture::orTimeout)
                .map(f -> f.exceptionally(ex -> ChatResponse.fallback()))
                .map(CompletableFuture::join)
                .filter(r -> !r.isFallback())
                .findFirst()
                .orElse(ChatResponse.fallback());
    }

    private CompletableFuture<ChatResponse> tryModel(String prompt, String model) {
        return CompletableFuture.supplyAsync(() -> llmService.chat(prompt, model))
                .orTimeout(10, TimeUnit.SECONDS)
                .exceptionally(ex -> {
                    log.warn("Model {} failed: {}", model, ex.getMessage());
                    return ChatResponse.fallback();
                });
    }
}

────────────────────────────────────────────────────────────

六、高可用架构设计总结

6.1 整体架构图

                          ┌─────────────────┐
                          │  
用户请求       │
                          └────────┬────────┘
                                   │
                          ┌────────▼────────┐
                          │   API
网关      │
                          │  (
限流/鉴权)    │
                          └────────┬────────┘
                                   │
        ┌──────────────────────────┼──────────────────────────┐
        │                          │                          │
┌───────▼───────┐         ┌───────▼───────┐         ┌───────▼───────┐
│  Sentinel     │         │   Redis       │         │  MySQL        │
│  (
熔断降级)   │         │  Cluster      │         │  (读写分离)    │
└───────────────┘         │  (
缓存/会话)   │         └───────────────┘
                          └───────────────┘
                                   │
                          ┌────────▼────────┐
                          │  LLM
服务集群   │
                          │  (
多模型备选)   │
                          └─────────────────┘

6.2 关键设计要点

熔断降级:配置合理的 RT 阈值和异常比例,避免雪崩扩散。缓存策略:多级缓存 + 合适的过期时间,平衡一致性与性能。读写分离:读操作路由到从库,写操作路由到主库,减轻主库压力。限流排队:令牌桶 + 请求合并,控制并发并提高吞吐量。异步化:充分利用 IO 等待时间,提高资源利用率。

────────────────────────────────────────────────────────────

总结

本文从熔断降级、缓存高可用、数据库读写分离、限流排队、异步优化五个维度,全面讲解了大模型 API 高可用架构与性能优化的核心技术与工程实践。通过合理的架构设计和参数调优,读者可以构建响应迅速、稳定可靠的大模型服务系统,为用户提供流畅的 AI 体验。

附:配套技术图解

1Sentinel 熔断降级架构图

1Sentinel 熔断降级与限流架构

2Redis 分布式缓存高可用架构图

2Redis Cluster 高可用与缓存策略架构

3MySQL 主从复制架构图

3MySQL 主从复制与读写分离架构

4:接口限流与排队合并策略图

4:限流算法与请求合并策略架构

更多推荐