Java 程序员第 40 阶段08：从零搭建 Java 大模型完整项目，高可用架构与性能优化

布吉岛的石头

221人浏览 · 2026-05-31 12:00:00

布吉岛的石头 · 2026-05-31 12:00:00 发布

前言

大模型 API 服务对可用性和性能有极高要求，任何响应延迟或服务中断都会直接影响用户体验和企业业务。本篇文章深入讲解 Sentinel 熔断降级配置、Redis 分布式缓存高可用、MySQL 主从复制与读写分离、接口限流与排队合并策略、异步化与线程池优化等核心主题，配套四张技术架构图帮助读者构建高可用、高性能的大模型服务架构。

────────────────────────────────────────────────────────────

一、Sentinel 熔断降级配置

1.1 熔断器模式核心原理

熔断器模式是防止雪崩效应的关键机制。熔断器有三种状态：Closed（关闭）——正常请求通过，失败计数累加；Open（打开）——所有请求快速失败，返回降级响应；Half-Open（半开）——尝试放行少量请求，若成功则恢复关闭状态，否则继续保持打开。

1.2 Sentinel 快速入门

<dependency>
  <groupId>com.alibaba.csp</groupId>
    <artifactId>sentinel-core</artifactId>
    <version>1.8.6</version>
</dependency>
<dependency>
    <groupId>com.alibaba.csp</groupId>
    <artifactId>sentinel-transport-simple-http</artifactId>
    <version>1.8.6</version>
</dependency>

1.3 熔断降级规则配置

import com.alibaba.csp.sentinel.slots.block.degrade.DegradeRule;
import com.alibaba.csp.sentinel.slots.block.degrade.DegradeRuleManager;
import com.alibaba.csp.sentinel.slots.block.RuleConstant;
import java.util.Arrays;

public class SentinelDegradeConfig {

    public static void initDegradeRules() {
        DegradeRule degradeRule = new DegradeRule("llmApi")
                .setGrade(RuleConstant.DEGRADE_GRADE_RT) // 按 RT 熔断
                .setCount(2000) // 阈值：2秒
                .setSlowRatioThreshold(0.5) // 50% 慢调用比例
                .setMinRequestAmount(5) // 最小请求数
                .setStatIntervalMs(60000) // 统计窗口：1分钟
                .setTimeWindow(30); // 熔断时长：30秒

        DegradeRule exceptionRule = new DegradeRule("llmApiException")
                .setGrade(RuleConstant.DEGRADE_GRADE_EXCEPTION_RATIO)
                .setCount(0.3) // 30% 异常比例
                .setMinRequestAmount(5)
                .setTimeWindow(60);

        DegradeRuleManager.loadRules(Arrays.asList(degradeRule, exceptionRule));
    }
}

1.4 Sentinel + Spring Boot 集成

@Aspect
@Component
public class SentinelAspect {

    @Around("@annotation(SentinelResource)")
    public Object around(ProceedingJoinPoint pjp, SentinelResource annotation) throws Throwable {
        String resourceName = annotation.value();
        Entry entry = null;

        try {
            entry = SphU.entry(resourceName);
            return pjp.proceed();
        } catch (BlockException e) {
            // 触发限流或熔断，执行降级逻辑
            return handleBlock(annotation.fallback());
        } finally {
            if (entry != null) {
                entry.exit();
            }
        }
    }

    private Object handleBlock(String fallbackName) {
        // 返回降级响应
        return "{\"error\":\"service_degraded\",\"message\":\"服务暂时繁忙，请稍后重试\"}";
    }
}

@SentinelResource(value = "chatCompletion", fallback = "chatFallback")
public ChatResponse chat(ChatRequest request) {
    return llmService.chat(request);
}

public ChatResponse chatFallback(ChatRequest request, Throwable t) {
    // 降级逻辑：返回缓存结果或友好提示
    return ChatResponse.degraded("服务降级，请稍后重试");
}

────────────────────────────────────────────────────────────

二、Redis 分布式缓存高可用

2.1 Redis Cluster 架构

Redis Cluster 采用槽（Slot）分片机制，3 主 3 从架构下，每个主节点负责 5460 个槽，通过 Gossip 协议进行节点间通信，实现自动故障转移。

                    ┌─────────────────┐
                    │ Redis Cluster │
                    │   (3 主 3 从)    │
                    └────────┬────────┘
                             │
    ┌──────────┬──────────┬─┴─┬──────────┬──────────┐
    │Master-1 │Master-2 │Master-3│Replica-1│Replica-2│Replica-3│
    │ Slot:0- │Slot:5460-│Slot:10922-│ (同步) │ (同步) │ (同步) │
    │ 5460    │ 10922   │ 16383   │         │         │         │
    └──────────┴──────────┴──────────┴─────────┴─────────┴─────────┘

2.2 多级缓存设计

大模型场景推荐多级缓存架构：L1 本地缓存——Caffeine（Guava Cache 替代品），存储热点数据，毫秒级访问；L2 分布式缓存——Redis Cluster，存储共享数据，支持跨节点访问；L3 持久化存储——MySQL，存储最终数据。

@Configuration
public class CacheConfig {

    @Bean
    public Cache<String, Object> localCache() {
        return Caffeine.newBuilder()
                .maximumSize(10000)
                .expireAfterWrite(1, TimeUnit.MINUTES)
                .recordStats()
                .build();
    }

    @Bean
    public RedisTemplate<String, Object> redisTemplate(RedisConnectionFactory factory) {
        RedisTemplate<String, Object> template = new RedisTemplate<>();
        template.setConnectionFactory(factory);
        template.setKeySerializer(new StringRedisSerializer());
        template.setValueSerializer(new GenericJackson2JsonRedisSerializer());
        template.setHashKeySerializer(new StringRedisSerializer());
        return template;
    }
}

@Service
public class LlmCacheService {

    @Autowired
    private Cache<String, Object> localCache;

    @Autowired
    private RedisTemplate<String, Object> redisTemplate;

    private static final String CACHE_PREFIX = "llm:";

    public Object getResponse(String prompt) {
        String key = hashPrompt(prompt);

        // L1: 本地缓存
        Object cached = localCache.getIfPresent(key);
        if (cached != null) {
            return cached;
        }

        // L2: Redis 缓存
        cached = redisTemplate.opsForValue().get(CACHE_PREFIX + key);
        if (cached != null) {
            localCache.put(key, cached); // 回填 L1
            return cached;
        }

        return null;
    }

    public void cacheResponse(String prompt, Object response) {
        String key = hashPrompt(prompt);
        redisTemplate.opsForValue().set(CACHE_PREFIX + key, response, 1, TimeUnit.HOURS);
        localCache.put(key, response);
    }
}

2.3 缓存策略选择

策略	原理	适用场景	一致性
Cache-Aside	应用主导读写，缓存旁路	读多写少	最终一致
Read-Through	缓存自动加载	简化应用逻辑	最终一致
Write-Through	同步写缓存和存储	数据一致性要求高	强一致
Write-Behind	异步写回	写入性能要求高	最终一致

────────────────────────────────────────────────────────────

三、MySQL 主从复制与读写分离

3.1 主从复制原理

MySQL 主从复制基于 Binlog 实现，主库将所有写操作记录到 Binlog，从库通过 I/O 线程读取主库 Binlog 并写入 Relay Log，SQL 线程重放 Relay Log 实现数据同步。

复制模式对比：异步复制——主库提交事务后立即返回，不等待从库确认；半同步复制——主库等待至少一个从库确认才提交；组复制（MGR）——基于 Paxos 协议实现多主复制。

3.2 ShardingSphere 读写分离配置

spring:
shardingsphere:
    datasource:
      ds-master:
        url: jdbc:mysql://192.168.1.10:3306/llm_db
        username: root
        password: ***
      ds-slave-0:
        url: jdbc:mysql://192.168.1.11:3306/llm_db
        username: root
        password: ***
      ds-slave-1:
        url: jdbc:mysql://192.168.1.12:3306/llm_db
        username: root
        password: ***
    rules:
      readwrite-splitting:
        data-sources:
          prd_ds:
            writeDataSourceName: ds-master
            readDataSourceNames:
              - ds-slave-0
              - ds-slave-1
            loadBalancerName: round_robin
        loadBalancers:
          round_robin:
            type: ROUND_ROBIN

3.3 分库分表设计

// 按 user_id 分库分表
@ShardingAlgorithm(value = ModShardingAlgorithm.class, props = {
    @Property(name = "sharding-count", value = "4")
})
public class UserOrderTable implements ShardingTable {

    @Override
    public String doSharding(String targetTable, Collection<String> availableTargetTables,
                            ShardingValue<String> shardingValue) {
        String userId = shardingValue.getValue();
        int tableIndex = Math.abs(userId.hashCode()) % 4;
        return targetTable + "_" + tableIndex;
    }
}

// 使用 ShardingSphere-JDBC
try (Connection conn = dataSource.getConnection()) {
    PreparedStatement ps = conn.prepareStatement(
        "SELECT * FROM chat_history WHERE user_id = ?");
    ps.setLong(1, userId);
    ResultSet rs = ps.executeQuery();
}

────────────────────────────────────────────────────────────

四、接口限流与排队合并策略

4.1 令牌桶算法实现

令牌桶算法是 API 限流最常用的方案，允许一定程度的突发流量。

import java.util.concurrent.atomic.AtomicLong;
import java.util.concurrent.locks.ReentrantLock;

public class TokenBucketRateLimiter {
    private final long capacity;      // 桶容量
    private final long refillRate;   // 每秒补充令牌数
    private final AtomicLong tokens;
    private final AtomicLong lastRefillTime;
    private final ReentrantLock lock = new ReentrantLock();

    public TokenBucketRateLimiter(long capacity, long refillRate) {
        this.capacity = capacity;
        this.refillRate = refillRate;
        this.tokens = new AtomicLong(capacity);
        this.lastRefillTime = new AtomicLong(System.nanoTime());
    }

    public boolean tryAcquire() {
        lock.lock();
        try {
            refill();
            if (tokens.get() > 0) {
                tokens.decrementAndGet();
                return true;
            }
            return false;
        } finally {
            lock.unlock();
        }
    }

    private void refill() {
        long now = System.nanoTime();
      long lastTime = lastRefillTime.get();
        long elapsed = now - lastTime;

        // 计算应该补充的令牌数
        long tokensToAdd = (elapsed * refillRate) / 1_000_000_000L;
        if (tokensToAdd > 0) {
            long newTokens = Math.min(capacity, tokens.get() + tokensToAdd);
            tokens.set(newTokens);
            lastRefillTime.set(now);
        }
    }

    public long availableTokens() {
        refill();
        return tokens.get();
    }
}

4.2 Redis + Lua 原子限流

@Service
public class RedisRateLimiter {

    @Autowired
    private StringRedisTemplate redisTemplate;

    private static final String RATE_LIMIT_SCRIPT =
        "local key = KEYS[1] " +
        "local limit = tonumber(ARGV[1]) " +
        "local window = tonumber(ARGV[2]) " +
      "local current = redis.call('INCR', key) " +
        "if current == 1 then redis.call('EXPIRE', key, window) end " +
        "return current <= limit and 1 or 0";

    public boolean isAllowed(String userId, int limit, int windowSeconds) {
        String key = "rate_limit:" + userId;
        Long result = redisTemplate.execute(
            (RedisCallback<Long>) conn ->
                conn.eval(
                    RATE_LIMIT_SCRIPT.getBytes(),
                    RedisScript.of(String.class),
                    Collections.singletonList(key),
                    String.valueOf(limit),
                    String.valueOf(windowSeconds)
                )
        );
        return result != null && result == 1L;
    }
}

4.3 请求合并策略

对于相同 Prompt 的多个请求，可以合并为一个 LLM 调用，显著降低成本和延迟。

@Service
public class PromptMergingService {

    private final ConcurrentHashMap<String, CompletableFuture<String>> pendingRequests = new ConcurrentHashMap<>();
    private final ScheduledExecutorService scheduler = Executors.newSingleThreadScheduledExecutor();

    public CompletableFuture<String> getResponse(String prompt) {
        String key = hashPrompt(prompt);

        // 检查是否有正在处理的相同请求
        CompletableFuture<String> existing = pendingRequests.get(key);
        if (existing != null) {
            return existing; // 复用进行中的请求
        }

        // 创建新请求
        CompletableFuture<String> future = new CompletableFuture<>();
        CompletableFuture<String> existing2 = pendingRequests.putIfAbsent(key, future);

        if (existing2 != null) {
            return existing2; // 已被其他线程创建
        }

        // 延迟合并窗口（例如 50ms）
        scheduler.schedule(() -> {
            pendingRequests.remove(key);
            executeAndComplete(key, prompt, future);
        }, 50, TimeUnit.MILLISECONDS);

        return future;
    }

    private void executeAndComplete(String key, String prompt, CompletableFuture<String> future) {
        try {
            String response = llmService.chat(prompt);
            future.complete(response);
        } catch (Exception e) {
            future.completeExceptionally(e);
        }
    }

    private String hashPrompt(String prompt) {
        try {
            MessageDigest md = MessageDigest.getInstance("SHA-256");
            byte[] hash = md.digest(prompt.getBytes(StandardCharsets.UTF_8));
            return Base64.getEncoder().encodeToString(hash).substring(0, 16);
        } catch (NoSuchAlgorithmException e) {
            return prompt.hashCode() + "";
        }
    }
}

────────────────────────────────────────────────────────────

五、异步化与线程池优化

5.1 异步调用设计

大模型 API 调用通常是 IO 密集型操作，异步化可以显著提升系统吞吐量。

@Service
public class AsyncLLMService {

    @Async("llmExecutor")
    public CompletableFuture<ChatResponse> chatAsync(ChatRequest request) {
        return CompletableFuture.supplyAsync(() -> {
            try {
                return llmService.chat(request);
            } catch (Exception e) {
                throw new CompletionException(e);
            }
        });
    }

    // 批量异步调用
    public CompletableFuture<List<ChatResponse>> batchChatAsync(List<ChatRequest> requests) {
        List<CompletableFuture<ChatResponse>> futures = requests.stream()
                .map(this::chatAsync)
                .collect(Collectors.toList());

        return CompletableFuture.allOf(futures.toArray(new CompletableFuture[0]))
                .thenApply(v -> futures.stream()
                        .map(CompletableFuture::join)
                        .collect(Collectors.toList()));
    }
}

@Configuration
public class AsyncConfig {

    @Bean("llmExecutor")
    public Executor llmExecutor() {
        ThreadPoolTaskExecutor executor = new ThreadPoolTaskExecutor();
        executor.setCorePoolSize(10);
        executor.setMaxPoolSize(50);
        executor.setQueueCapacity(200);
        executor.setKeepAliveSeconds(60);
        executor.setThreadNamePrefix("llm-async-");
        executor.setRejectedExecutionHandler(new ThreadPoolExecutor.CallerRunsPolicy());
        executor.initialize();
        return executor;
    }
}

5.2 线程池优化策略

核心参数调优原则：

CPU 密集型任务：核心线程数 = CPU 核心数 + 1，减少上下文切换
IO 密集型任务：核心线程数 = CPU 核心数 * 2 或更高，充分利用等待时间
大模型场景：IO 密集型 + 等待时间长，建议核心线程数设为 2 * CPU 核心数，队列不宜过大

// 动态线程池配置
@Configuration
public class DynamicThreadPoolConfig {

    @Value("${llm.thread.core:10}")
    private int corePoolSize;

    @Value("${llm.thread.max:50}")
    private int maxPoolSize;

    @Value("${llm.thread.queue:200}")
    private int queueCapacity;

    @Bean("llmTaskExecutor")
    public TaskExecutor llmTaskExecutor() {
        ThreadPoolTaskExecutor executor = new ThreadPoolTaskExecutor();
        executor.setCorePoolSize(corePoolSize);
        executor.setMaxPoolSize(maxPoolSize);
        executor.setQueueCapacity(queueCapacity);
        executor.setThreadNamePrefix("llm-");
        executor.setWaitForTasksToCompleteOnShutdown(true);
        executor.setAwaitTerminationSeconds(60);
        // 拒绝策略：使用调用者线程执行
        executor.setRejectedExecutionHandler(new ThreadPoolExecutor.CallerRunsPolicy());
       executor.initialize();
        return executor;
    }
}

5.3 CompletableFuture 组合调用

@Service
public class LLMCompositionService {

    public ChatResponse chatWithFallback(String prompt, List<String> models) {
        // 尝试多个模型，任一成功即返回
        List<CompletableFuture<ChatResponse>> futures = models.stream()
                .map(model -> tryModel(prompt, model))
                .collect(Collectors.toList());

        return futures.stream()
                .map(CompletableFuture::orTimeout)
                .map(f -> f.exceptionally(ex -> ChatResponse.fallback()))
                .map(CompletableFuture::join)
                .filter(r -> !r.isFallback())
                .findFirst()
                .orElse(ChatResponse.fallback());
    }

    private CompletableFuture<ChatResponse> tryModel(String prompt, String model) {
        return CompletableFuture.supplyAsync(() -> llmService.chat(prompt, model))
                .orTimeout(10, TimeUnit.SECONDS)
                .exceptionally(ex -> {
                    log.warn("Model {} failed: {}", model, ex.getMessage());
                    return ChatResponse.fallback();
                });
    }
}

────────────────────────────────────────────────────────────

六、高可用架构设计总结

6.1 整体架构图

                          ┌─────────────────┐
                          │   用户请求       │
                          └────────┬────────┘
                                   │
                          ┌────────▼────────┐
                          │   API 网关      │
                         │ (限流/鉴权)    │
                          └────────┬────────┘
                                   │
        ┌──────────────────────────┼──────────────────────────┐
        │                          │                          │
┌───────▼───────┐         ┌───────▼───────┐         ┌───────▼───────┐
│ Sentinel     │         │   Redis       │         │ MySQL        │
│ (熔断降级)   │         │ Cluster      │         │ (读写分离)    │
└───────────────┘         │ (缓存/会话)   │         └───────────────┘
                          └───────────────┘
                                   │
                          ┌────────▼────────┐
                          │ LLM 服务集群   │
                          │ (多模型备选)   │
                          └─────────────────┘

6.2 关键设计要点

熔断降级：配置合理的 RT 阈值和异常比例，避免雪崩扩散。缓存策略：多级缓存 + 合适的过期时间，平衡一致性与性能。读写分离：读操作路由到从库，写操作路由到主库，减轻主库压力。限流排队：令牌桶 + 请求合并，控制并发并提高吞吐量。异步化：充分利用 IO 等待时间，提高资源利用率。

────────────────────────────────────────────────────────────

总结

本文从熔断降级、缓存高可用、数据库读写分离、限流排队、异步优化五个维度，全面讲解了大模型 API 高可用架构与性能优化的核心技术与工程实践。通过合理的架构设计和参数调优，读者可以构建响应迅速、稳定可靠的大模型服务系统，为用户提供流畅的 AI 体验。

附：配套技术图解

图1：Sentinel 熔断降级架构图

图1：Sentinel 熔断降级与限流架构

图2：Redis 分布式缓存高可用架构图

图2：Redis Cluster 高可用与缓存策略架构

图3：MySQL 主从复制架构图

图3：MySQL 主从复制与读写分离架构

图4：接口限流与排队合并策略图

图4：限流算法与请求合并策略架构

亚马逊云科技技术品牌专区

更多推荐

Kiro Editor 开发实战：使用 Cargo 构建、测试与性能优化指南

欢迎来到这篇终极指南，我们将深入探索如何使用Rust构建高性能的终端文本编辑器Kiro Editor。无论你是Rust新手还是经验丰富的开发者，这篇完整教程将带你了解如何利用Cargo工具链进行高效的开发、测试和性能优化，打造一款快速、轻量且功能强大的UTF-8文本编辑器。## 什么是Kiro Editor？Kiro Editor是一款使用Rust编写的极简终端文本编辑器，它最初是著名编辑