概述

本文介绍 Java 大模型项目的部署上线与运维监控体系,涵盖 Docker 容器化打包、Kubernetes 容器编排、ArgoCD GitOps 持续部署、日志收集与查询(ELK)、以及 SkyWalking 链路追踪等核心内容。

1. Docker 镜像打包与构建

1.1 多阶段构建 Dockerfile

# 第一阶段:构建
FROM maven:3.9-eclipse-temurin-17 AS builder
WORKDIR /app
COPY pom.xml .
RUN mvn dependency:go-offline
COPY src ./src
RUN mvn clean package -DskipTests

# 第二阶段:运行
FROM eclipse-temurin:17-jre-alpine
WORKDIR /app
COPY --from=builder /app/target/*.jar app.jar
COPY config/ ./config/
EXPOSE 8080
ENTRYPOINT ["java", "-jar", "-Xms512m", "-Xmx2g", "app.jar"]

1.2 模型镜像构建

FROM nvidia/cuda:12.1-runtime-ubuntu22.04
WORKDIR /app
COPY model/ ./model/
COPY transformer.jar ./transformer.jar
ENV MODEL_PATH=/app/model
ENV JAVA_OPTS="-Xms4g -Xmx8g"
CMD ["java", "-jar", "transformer.jar"]

1.3 镜像构建与推送

# 构建镜像
docker build -t llm-java-app:v1.0 .

# 打标签
docker tag llm-java-app:v1.0 registry.example.com/llm-java-app:v1.0

# 推送到私有仓库
docker push registry.example.com/llm-java-app:v1.0

2. Kubernetes 容器编排(K8s)

2.1 Deployment 配置

apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-java-app
  labels:
    app: llm-java-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: llm-java-app
  template:
    metadata:
      labels:
        app: llm-java-app
    spec:
      containers:
      - name: app
        image: registry.example.com/llm-java-app:v1.0
        ports:
        - containerPort: 8080
        resources:
          requests:
            memory: "512Mi"
            cpu: "500m"
          limits:
            memory: "2Gi"
            cpu: "2000m"
        env:
        - name: SPRING_PROFILES_ACTIVE
          value: "prod"

2.2 Service 与 Ingress

apiVersion: v1
kind: Service
metadata:
  name: llm-java-service
spec:
  selector:
    app: llm-java-app
  ports:
  - port: 80
    targetPort: 8080
  type: ClusterIP

---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: llm-ingress
  annotations:
    nginx.ingress.kubernetes.io/rewrite-target: /
spec:
  rules:
  - host: llm.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: llm-java-service
            port:
              number: 80

2.3 HPA 自动伸缩

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llm-java-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llm-java-app
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

3. ArgoCD GitOps 持续部署

3.1 ArgoCD 安装

# 安装 ArgoCD
kubectl create namespace argocd
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml

# 安装 ArgoCD CLI
brew install argocd

# 登录 ArgoCD
argocd login --name argocd-server --username admin --password <password>

3.2 Application 配置

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: llm-java-app
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/example/llm-k8s-manifests.git
    targetRevision: main
    path: production
  destination:
    server: https://kubernetes.default.svc
    namespace: llm-prod
  syncPolicy:
    automated:
      prune: true
      selfHeal: true

3.3 镜像自动更新

使用 Image Updater 实现镜像 tag 自动更新:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: llm-java-app
spec:
  source:
    plugin:
      name: argocd-image-updater
      config:
      - name: dockerhub
        driver: docker
        api_url: https://registry.example.com/
        credentials: ext:git:https://github.com/example/k8s-secrets.git
        default_version: latest

4. 日志收集与查询(ELK)

4.1 Filebeat 配置

apiVersion: v1
kind: ConfigMap
metadata:
  name: filebeat-config
data:
  filebeat.yml: |
    filebeat.inputs:
    - type: container
      paths:
        - /var/log/containers/*.log
      processors:
        - add_kubernetes_metadata:
            host: ${NODE_NAME}
            matchers:
              - logs_path:
                  logs_path: /var/log/containers/
    output.logstash:
      hosts: ["logstash:5044"]

4.2 Logstash 管道配置

input {
  beats {
    port => 5044
  }
}

filter {
  json {
    source => "message"
  }
  date {
    match => ["timestamp", "ISO8601"]
    target => "@timestamp"
  }
  if [level] == "ERROR" {
    mutate {
      add_tag => ["error"]
    }
  }
}

output {
  elasticsearch {
    hosts => ["elasticsearch:9200"]
    index => "llm-logs-%{+YYYY.MM.dd}"
  }
}

4.3 Kibana 日志查询

# 错误日志查询
level:ERROR AND app_name:"llm-java"

# 响应时间超过阈值的请求
response_time:>1000 AND path:"/api/llm/*"

# 特定用户的操作日志
user_id:"user123" AND timestamp:[now-1h TO now]

5. SkyWalking 链路追踪

5.1 Java Agent 配置

# 下载 Agent
wget https://archive.apache.org/dist/skywalking/9.5.0/apache-skywalking-apm-9.5.0.tar.gz
tar -xzf apache-skywalking-apm-9.5.0.tar.gz

# JVM 启动参数
java -javaagent:/path/to/skywalking-agent/skywalking-agent.jar \
     -Dskywalking.agent.service_name=llm-java-app \
     -Dskywalking.collector.backend_service=oap:11800 \
     -jar llm-app.jar

5.2 Kubernetes Agent 挂载

apiVersion: v1
kind: ConfigMap
metadata:
  name: skywalking-agent
data:
  agent.config: |
    agent.service_name=${APP_NAME}
    collector.backend_service=${SKYWALKING_OAP}
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-java-app
spec:
  template:
    spec:
      containers:
      - name: app
        image: llm-java-app:v1.0
        volumeMounts:
        - name: skywalking-agent
          mountPath: /agent
        env:
        - name: JAVA_OPTS
          value: "-javaagent:/agent/skywalking-agent.jar"
      volumes:
      - name: skywalking-agent
        emptyDir: {}

5.3 自定义追踪注解

import org.apache.skywalking.apm.toolkit.trace.annotation.Trace;

@Service
public class LLMService {

    @Trace
    public String generateResponse(String prompt) {
        // LLM 调用会被追踪
        return llmClient.invoke(prompt);
    }

    @Trace
    @Tag(key = "model", value = "${tag.model_name}")
    public String callModel(String prompt, @SpanTag("model") String model) {
        return llmClient.invoke(prompt, model);
    }
}

总结

本文介绍了 Java 大模型项目的完整部署与运维体系:

组件

作用

关键工具

Docker

容器化打包

Dockerfile, BuildKit

Kubernetes

容器编排

Deployment, Service, HPA

ArgoCD

GitOps 部署

Application, Image Updater

ELK

日志收集

Filebeat, Logstash, Kibana

SkyWalking

链路追踪

Java Agent, OAP, UI

通过这套完整的 DevOps 体系,可以实现大模型应用的自动化部署、可观测性监控和持续迭代。

更多推荐