14. 生产环境,垂直模型的如何部署?生产环境使用K8S完成多机多卡部署的核心流程和方案

生产环境K8S部署概述

为什么选择Kubernetes?

Kubernetes作为容器编排平台,为生产环境的大模型部署提供了以下核心优势:

核心能力对比

能力维度 传统部署 Kubernetes部署
扩展性 手动扩展,响应慢 自动扩展,秒级响应
高可用 单点故障风险 自动故障恢复
资源管理 资源利用率低 精细化资源调度
运维复杂度 高运维负担 声明式运维
成本优化 资源浪费严重 按需分配,成本可控

生产环境架构设计

1.1 整体架构

多集群架构设计

Monitoring & Logging
Production Cluster - Region 2
Production Cluster - Region 1
Global Load Balancer
Prometheus
Grafana
ELK Stack
Jaeger
API Gateway
LLM Service Pods
GPU Nodes
CPU Nodes
Storage
API Gateway
LLM Service Pods
GPU Nodes
CPU Nodes
Storage
Cloudflare/AWS ALB

单集群内部架构

# cluster-architecture.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-architecture
  namespace: production
data:
  architecture.yaml: |
    cluster:
      name: "llm-production"
      region: "us-east-1"
      availability_zones: ["us-east-1a", "us-east-1b", "us-east-1c"]
    
    node_groups:
      gpu_nodes:
        instance_type: "p3.8xlarge"  # 4x V100 GPUs
        min_nodes: 3
        max_nodes: 12
        gpu_per_node: 4
        labels:
          node-type: "gpu"
          workload-type: "llm-inference"
      
      cpu_nodes:
        instance_type: "c5.4xlarge"
        min_nodes: 6
        max_nodes: 20
        labels:
          node-type: "cpu"
          workload-type: "api-gateway"
      
      system_nodes:
        instance_type: "t3.large"
        min_nodes: 3
        max_nodes: 6
        labels:
          node-type: "system"
          workload-type: "monitoring"
1.2 GPU资源管理

GPU节点配置

# gpu-node-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: gpu-node-config
  namespace: kube-system
data:
  nvidia-device-plugin.yml: |
    apiVersion: apps/v1
    kind: DaemonSet
    metadata:
      name: nvidia-device-plugin-daemonset
      namespace: kube-system
    spec:
      selector:
        matchLabels:
          name: nvidia-device-plugin-ds
      template:
        metadata:
          labels:
            name: nvidia-device-plugin-ds
        spec:
          tolerations:
          - key: nvidia.com/gpu
            operator: Exists
            effect: NoSchedule
          containers:
          - image: nvcr.io/nvidia/k8s-device-plugin:v0.14.0
            name: nvidia-device-plugin-ctr
            env:
            - name: FAIL_ON_INIT_ERROR
              value: "false"
            securityContext:
              allowPrivilegeEscalation: false
              capabilities:
                drop: ["ALL"]
            volumeMounts:
            - name: device-plugin
              mountPath: /var/lib/kubelet/device-plugins
          volumes:
          - name: device-plugin
            hostPath:
              path: /var/lib/kubelet/device-plugins

GPU资源调度策略

# gpu-scheduler-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: scheduler-config
  namespace: kube-system
data:
  config.yaml: |
    apiVersion: kubescheduler.config.k8s.io/v1beta3
    kind: KubeSchedulerConfiguration
    profiles:
    - schedulerName: gpu-scheduler
      plugins:
        filter:
          enabled:
          - name: NodeResourcesFit
        score:
          enabled:
          - name: NodeResourcesFit
          - name: PodTopologySpread
      pluginConfig:
      - name: NodeResourcesFit
        args:
          scoringStrategy:
            type: MostAllocated
            resources:
            - name: nvidia.com/gpu
              weight: 100
            - name: cpu
              weight: 10
            - name: memory
              weight: 10

核心组件部署

2.1 LLM服务部署

StatefulSet配置

# llm-service-statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: llm-service
  namespace: production
spec:
  serviceName: llm-service
  replicas: 3
  selector:
    matchLabels:
      app: llm-service
  template:
    metadata:
      labels:
        app: llm-service
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: node-type
                operator: In
                values: ["gpu"]
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values: ["llm-service"]
            topologyKey: kubernetes.io/hostname
      containers:
      - name: llm-service
        image: your-registry/llm-service:v1.0.0
        ports:
        - containerPort: 8000
          name: http
        - containerPort: 8001
          name: metrics
        resources:
          requests:
            nvidia.com/gpu: "2"
            memory: "32Gi"
            cpu: "8"
          limits:
            nvidia.com/gpu: "2"
            memory: "48Gi"
            cpu: "16"
        env:
        - name: MODEL_NAME
          value: "meta-llama/Llama-2-70b-hf"
        - name: TENSOR_PARALLEL_SIZE
          value: "2"
        - name: GPU_MEMORY_UTILIZATION
          value: "0.9"
        - name: MAX_MODEL_LEN
          value: "4096"
        volumeMounts:
        - name: model-storage
          mountPath: /models
        - name: config-volume
          mountPath: /config
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 300
          periodSeconds: 30
          timeoutSeconds: 10
        readinessProbe:
          httpGet:
            path: /ready
            port: 8000
          initialDelaySeconds: 60
          periodSeconds: 10
          timeoutSeconds: 5
      volumes:
      - name: model-storage
        persistentVolumeClaim:
          claimName: llm-model-pvc
      - name: config-volume
        configMap:
          name: llm-service-config
  volumeClaimTemplates:
  - metadata:
      name: model-storage
    spec:
      accessModes: ["ReadWriteOnce"]
      storageClassName: fast-ssd
      resources:
        requests:
          storage: 100Gi

服务配置

# llm-service-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: llm-service-config
  namespace: production
data:
  config.yaml: |
    model:
      name: "meta-llama/Llama-2-70b-hf"
      quantization: "awq"
      bits: 4
      group_size: 128
    
    serving:
      max_batch_size: 256
      max_num_seqs: 256
      tensor_parallel_size: 2
      gpu_memory_utilization: 0.9
      swap_space: 4
    
    optimization:
      enable_prefix_caching: true
      enable_chunked_prefill: true
      max_num_batched_tokens: 4096
    
    logging:
      level: "INFO"
      format: "json"
    
    monitoring:
      enable_metrics: true
      metrics_port: 8001

服务暴露

# llm-service.yaml
apiVersion: v1
kind: Service
metadata:
  name: llm-service
  namespace: production
  labels:
    app: llm-service
spec:
  ports:
  - port: 80
    targetPort: 8000
    name: http
  - port: 9090
    targetPort: 8001
    name: metrics
  clusterIP: None
  selector:
    app: llm-service
---
apiVersion: v1
kind: Service
metadata:
  name: llm-service-loadbalancer
  namespace: production
spec:
  type: LoadBalancer
  ports:
  - port: 80
    targetPort: 8000
    name: http
  selector:
    app: llm-service
  sessionAffinity: ClientIP
  sessionAffinityConfig:
    clientIP:
      timeoutSeconds: 10800
2.2 API网关部署

Ingress控制器配置

# nginx-ingress-controller.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-ingress-controller
  namespace: ingress-nginx
spec:
  replicas: 3
  selector:
    matchLabels:
      app.kubernetes.io/name: ingress-nginx
      app.kubernetes.io/component: controller
  template:
    metadata:
      labels:
        app.kubernetes.io/name: ingress-nginx
        app.kubernetes.io/component: controller
    spec:
      containers:
      - name: controller
        image: registry.k8s.io/ingress-nginx/controller:v1.8.1
        args:
        - /nginx-ingress-controller
        - --configmap=$(POD_NAMESPACE)/nginx-configuration
        - --tcp-services-configmap=$(POD_NAMESPACE)/tcp-services
        - --udp-services-configmap=$(POD_NAMESPACE)/udp-services
        - --annotations-prefix=nginx.ingress.kubernetes.io
        - --enable-metrics=true
        - --metrics-port=10254
        env:
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: POD_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        ports:
        - name: http
          containerPort: 80
        - name: https
          containerPort: 443
        - name: metrics
          containerPort: 10254
        resources:
          requests:
            cpu: 100m
            memory: 90Mi
          limits:
            cpu: 500m
            memory: 512Mi

API路由配置

# api-ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: llm-api-ingress
  namespace: production
  annotations:
    nginx.ingress.kubernetes.io/rewrite-target: /
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
    nginx.ingress.kubernetes.io/rate-limit: "100"
    nginx.ingress.kubernetes.io/rate-limit-window: "1m"
    cert-manager.io/cluster-issuer: "letsencrypt-prod"
spec:
  ingressClassName: nginx
  tls:
  - hosts:
    - api.llm-production.com
    secretName: llm-api-tls
  rules:
  - host: api.llm-production.com
    http:
      paths:
      - path: /v1/generate
        pathType: Prefix
        backend:
          service:
            name: llm-service-loadbalancer
            port:
              number: 80
      - path: /v1/health
        pathType: Prefix
        backend:
          service:
            name: llm-service-loadbalancer
            port:
              number: 80
2.3 负载均衡与自动扩展

Horizontal Pod Autoscaler

# hpa-config.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llm-service-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: StatefulSet
    name: llm-service
  minReplicas: 3
  maxReplicas: 12
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "50"
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 10
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
      - type: Pods
        value: 2
        periodSeconds: 60
      selectPolicy: Max

Vertical Pod Autoscaler

# vpa-config.yaml
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: llm-service-vpa
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: StatefulSet
    name: llm-service
  updatePolicy:
    updateMode: "Auto"
  resourcePolicy:
    containerPolicies:
    - containerName: llm-service
      maxAllowed:
        cpu: 32
        memory: 128Gi
      minAllowed:
        cpu: 4
        memory: 16Gi
      controlledResources: ["cpu", "memory"]

存储与数据管理

3.1 模型存储方案

分布式存储配置

# distributed-storage.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
  name: llm-model-pv
spec:
  capacity:
    storage: 200Gi
  accessModes:
  - ReadWriteMany
  persistentVolumeReclaimPolicy: Retain
  storageClassName: distributed-fast
  nfs:
    server: nfs-server.production.svc.cluster.local
    path: "/models"
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: llm-model-pvc
  namespace: production
spec:
  accessModes:
  - ReadWriteMany
  storageClassName: distributed-fast
  resources:
    requests:
      storage: 200Gi

模型版本管理

# model_version_manager.py
import os
import json
import hashlib
from datetime import datetime
import boto3

class ModelVersionManager:
    """模型版本管理器"""
    
    def __init__(self, storage_backend="s3", bucket_name="llm-models"):
        self.storage_backend = storage_backend
        self.bucket_name = bucket_name
        
        if storage_backend == "s3":
            self.s3_client = boto3.client('s3')
    
    def upload_model(self, model_path, model_name, version):
        """上传模型"""
        # 计算模型哈希
        model_hash = self.calculate_file_hash(model_path)
        
        # 构建S3路径
        s3_key = f"models/{model_name}/v{version}/model.bin"
        
        # 上传模型文件
        self.s3_client.upload_file(model_path, self.bucket_name, s3_key)
        
        # 创建版本元数据
        version_metadata = {
            "model_name": model_name,
            "version": version,
            "hash": model_hash,
            "upload_time": datetime.now().isoformat(),
            "file_size": os.path.getsize(model_path),
            "metadata": self.extract_model_metadata(model_path)
        }
        
        # 保存元数据
        metadata_key = f"models/{model_name}/v{version}/metadata.json"
        self.s3_client.put_object(
            Bucket=self.bucket_name,
            Key=metadata_key,
            Body=json.dumps(version_metadata)
        )
        
        return version_metadata
    
    def promote_model(self, model_name, version, environment):
        """提升模型到指定环境"""
        # 获取模型元数据
        metadata = self.get_model_metadata(model_name, version)
        
        # 创建提升记录
        promotion_record = {
            "model_name": model_name,
            "version": version,
            "environment": environment,
            "promotion_time": datetime.now().isoformat(),
            "promoted_by": os.getenv("USER", "system"),
            "metadata": metadata
        }
        
        # 保存提升记录
        promotion_key = f"promotions/{environment}/{model_name}/v{version}.json"
        self.s3_client.put_object(
            Bucket=self.bucket_name,
            Key=promotion_key,
            Body=json.dumps(promotion_record)
        )
        
        return promotion_record
    
    def get_model_metadata(self, model_name, version):
        """获取模型元数据"""
        metadata_key = f"models/{model_name}/v{version}/metadata.json"
        
        try:
            response = self.s3_client.get_object(
                Bucket=self.bucket_name,
                Key=metadata_key
            )
            return json.loads(response['Body'].read())
        except self.s3_client.exceptions.NoSuchKey:
            return None
    
    def calculate_file_hash(self, file_path):
        """计算文件哈希"""
        sha256_hash = hashlib.sha256()
        with open(file_path, "rb") as f:
            for byte_block in iter(lambda: f.read(4096), b""):
                sha256_hash.update(byte_block)
        return sha256_hash.hexdigest()
    
    def extract_model_metadata(self, model_path):
        """提取模型元数据"""
        # 这里可以添加模型特定的元数据提取逻辑
        return {
            "framework": "pytorch",
            "quantization": "awq",
            "bits": 4
        }
3.2 配置管理

ConfigMap管理

# config-management.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: llm-config-template
  namespace: production
data:
  model_config.yaml: |
    model:
      name: "{{ .Values.model.name }}"
      quantization: "{{ .Values.model.quantization }}"
      bits: {{ .Values.model.bits }}
    
    serving:
      max_batch_size: {{ .Values.serving.maxBatchSize }}
      max_num_seqs: {{ .Values.serving.maxNumSeqs }}
      tensor_parallel_size: {{ .Values.serving.tensorParallelSize }}
    
    optimization:
      enable_prefix_caching: {{ .Values.optimization.enablePrefixCaching }}
      enable_chunked_prefill: {{ .Values.optimization.enableChunkedPrefill }}
---
apiVersion: v1
kind: Secret
metadata:
  name: llm-secrets
  namespace: production
type: Opaque
stringData:
  database-url: "postgresql://user:password@db.production.svc.cluster.local:5432/llm"
  api-key: "your-api-key-here"
  model-registry-token: "your-registry-token"

配置热更新

# config_hot_reloader.py
import os
import time
import yaml
from kubernetes import client, config
import threading

class ConfigHotReloader:
    """配置热重载器"""
    
    def __init__(self, namespace="production"):
        config.load_incluster_config()
        self.v1 = client.CoreV1Api()
        self.namespace = namespace
        self.watch_thread = None
        self.running = False
        
    def start_watching(self, config_map_name, callback):
        """开始监控配置变化"""
        self.running = True
        self.watch_thread = threading.Thread(
            target=self._watch_config_map,
            args=(config_map_name, callback)
        )
        self.watch_thread.daemon = True
        self.watch_thread.start()
        
    def stop_watching(self):
        """停止监控"""
        self.running = False
        if self.watch_thread:
            self.watch_thread.join()
    
    def _watch_config_map(self, config_map_name, callback):
        """监控ConfigMap变化"""
        w = client.Watch()
        
        for event in w.stream(
            self.v1.list_namespaced_config_map,
            namespace=self.namespace,
            field_selector=f"metadata.name={config_map_name}"
        ):
            if not self.running:
                break
                
            config_map = event['object']
            event_type = event['type']
            
            if event_type in ['ADDED', 'MODIFIED']:
                print(f"ConfigMap {config_map_name} {event_type}")
                
                # 提取配置数据
                config_data = config_map.data
                
                # 调用回调函数处理新配置
                try:
                    callback(config_data)
                except Exception as e:
                    print(f"Error processing config change: {e}")
        
        w.stop()
    
    def reload_configuration(self, config_data):
        """重载配置"""
        # 解析YAML配置
        try:
            new_config = yaml.safe_load(config_data.get('config.yaml', ''))
            
            # 验证配置格式
            if self.validate_config(new_config):
                # 应用新配置
                self.apply_config(new_config)
                print("Configuration reloaded successfully")
            else:
                print("Invalid configuration format, keeping current config")
                
        except yaml.YAMLError as e:
            print(f"Error parsing configuration: {e}")
    
    def validate_config(self, config):
        """验证配置格式"""
        required_keys = ['model', 'serving', 'optimization']
        
        for key in required_keys:
            if key not in config:
                return False
        
        return True
    
    def apply_config(self, config):
        """应用新配置"""
        # 这里可以实现具体的配置应用逻辑
        # 例如:更新服务参数、重启组件等
        print(f"Applying new configuration: {config}")
        
        # 通知其他组件配置已更新
        self.notify_config_update(config)
    
    def notify_config_update(self, config):
        """通知配置更新"""
        # 可以通过消息队列、WebSocket等方式通知其他组件
        pass

# 使用示例
def handle_config_change(config_data):
    """处理配置变化"""
    print("Configuration changed, reloading...")
    # 实现具体的配置重载逻辑
    pass

reloader = ConfigHotReloader()
reloader.start_watching("llm-service-config", handle_config_change)

监控与可观测性

4.1 综合监控体系

Prometheus + Grafana监控栈

# monitoring-stack.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: monitoring
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus
  template:
    metadata:
      labels:
        app: prometheus
    spec:
      containers:
      - name: prometheus
        image: prom/prometheus:v2.45.0
        args:
        - '--config.file=/etc/prometheus/prometheus.yml'
        - '--storage.tsdb.path=/prometheus/'
        - '--web.console.libraries=/etc/prometheus/console_libraries'
        - '--web.console.templates=/etc/prometheus/consoles'
        - '--web.enable-lifecycle'
        ports:
        - containerPort: 9090
        volumeMounts:
        - name: prometheus-config
          mountPath: /etc/prometheus
        - name: prometheus-storage
          mountPath: /prometheus
        resources:
          requests:
            cpu: 500m
            memory: 1Gi
          limits:
            cpu: 2
            memory: 4Gi
      volumes:
      - name: prometheus-config
        configMap:
          name: prometheus-config
      - name: prometheus-storage
        persistentVolumeClaim:
          claimName: prometheus-pvc
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
  namespace: monitoring
data:
  prometheus.yml: |
    global:
      scrape_interval: 15s
      evaluation_interval: 15s
    
    scrape_configs:
    - job_name: 'llm-service'
      kubernetes_sd_configs:
      - role: pod
        namespaces:
        - production
      relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app]
        action: keep
        regex: llm-service
      - source_labels: [__meta_kubernetes_pod_name]
        target_label: pod
      - source_labels: [__meta_kubernetes_namespace]
        target_label: namespace
      metrics_path: /metrics
      scrape_interval: 10s
    
    - job_name: 'kubernetes-nodes'
      kubernetes_sd_configs:
      - role: node
      relabel_configs:
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)
      - target_label: __address__
        replacement: kubernetes.default.svc:443
      - source_labels: [__meta_kubernetes_node_name]
        regex: (.+)
        target_label: __metrics_path__
        replacement: /api/v1/nodes/${1}/proxy/metrics

自定义监控指标

# custom_metrics.py
from prometheus_client import Counter, Histogram, Gauge, start_http_server
import time
import psutil
import GPUtil

class LLMMetricsCollector:
    """LLM自定义指标收集器"""
    
    def __init__(self, port=8001):
        # 业务指标
        self.request_count = Counter(
            'llm_requests_total',
            'Total number of LLM requests',
            ['model_name', 'status']
        )
        
        self.request_duration = Histogram(
            'llm_request_duration_seconds',
            'LLM request duration in seconds',
            ['model_name', 'batch_size']
        )
        
        self.tokens_generated = Counter(
            'llm_tokens_generated_total',
            'Total number of tokens generated',
            ['model_name']
        )
        
        # 系统指标
        self.gpu_utilization = Gauge(
            'gpu_utilization_percent',
            'GPU utilization percentage',
            ['gpu_index', 'node_name']
        )
        
        self.gpu_memory_used = Gauge(
            'gpu_memory_used_bytes',
            'GPU memory used in bytes',
            ['gpu_index', 'node_name']
        )
        
        self.model_load_time = Histogram(
            'llm_model_load_duration_seconds',
            'Model loading duration in seconds',
            ['model_name']
        )
        
        # 启动指标服务器
        start_http_server(port)
        
    def record_request(self, model_name, duration, batch_size=1, status="success"):
        """记录请求指标"""
        self.request_count.labels(model_name=model_name, status=status).inc()
        self.request_duration.labels(model_name=model_name, batch_size=batch_size).observe(duration)
    
    def record_tokens(self, model_name, token_count):
        """记录生成的token数"""
        self.tokens_generated.labels(model_name=model_name).inc(token_count)
    
    def update_gpu_metrics(self, node_name):
        """更新GPU指标"""
        try:
            gpus = GPUtil.getGPUs()
            for i, gpu in enumerate(gpus):
                self.gpu_utilization.labels(gpu_index=i, node_name=node_name).set(gpu.load * 100)
                self.gpu_memory_used.labels(gpu_index=i, node_name=node_name).set(gpu.memoryUsed * 1024 * 1024)
        except Exception as e:
            print(f"Error updating GPU metrics: {e}")
    
    def record_model_load(self, model_name, duration):
        """记录模型加载时间"""
        self.model_load_time.labels(model_name=model_name).observe(duration)

# 使用示例
metrics_collector = LLMMetricsCollector()

# 在应用中使用
def handle_request(model_name, prompt):
    start_time = time.time()
    
    try:
        # 生成回复
        response = generate_response(model_name, prompt)
        
        # 记录成功指标
        duration = time.time() - start_time
        metrics_collector.record_request(model_name, duration, status="success")
        metrics_collector.record_tokens(model_name, len(response.split()))
        
        return response
        
    except Exception as e:
        # 记录失败指标
        duration = time.time() - start_time
        metrics_collector.record_request(model_name, duration, status="failed")
        raise
4.2 分布式追踪

Jaeger集成

# jaeger-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: jaeger
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app: jaeger
  template:
    metadata:
      labels:
        app: jaeger
    spec:
      containers:
      - name: jaeger
        image: jaegertracing/all-in-one:1.47
        ports:
        - containerPort: 5775
          protocol: UDP
        - containerPort: 6831
          protocol: UDP
        - containerPort: 6832
          protocol: UDP
        - containerPort: 5778
          protocol: TCP
        - containerPort: 16686
          protocol: TCP
        - containerPort: 14268
          protocol: TCP
        - containerPort: 14250
          protocol: TCP
        - containerPort: 9411
          protocol: TCP
        env:
        - name: COLLECTOR_ZIPKIN_HOST_PORT
          value: ":9411"
        - name: SPAN_STORAGE_TYPE
          value: "elasticsearch"
        - name: ES_SERVER_URLS
          value: "http://elasticsearch.monitoring.svc.cluster.local:9200"

应用端追踪集成

# tracing_integration.py
from jaeger_client import Config
from opentracing.ext import tags
from opentracing.propagation import Format
import opentracing

def init_tracer(service_name='llm-service'):
    """初始化追踪器"""
    config = Config(
        config={
            'sampler': {
                'type': 'const',
                'param': 1,
            },
            'local_agent': {
                'reporting_host': 'jaeger-agent.monitoring.svc.cluster.local',
                'reporting_port': '6831',
            },
            'logging': True,
        },
        service_name=service_name,
        validate=True,
    )
    
    return config.initialize_tracer()

tracer = init_tracer()

def traced_function(func):
    """追踪装饰器"""
    def wrapper(*args, **kwargs):
        with tracer.start_active_span(func.__name__) as scope:
            try:
                result = func(*args, **kwargs)
                scope.span.set_tag(tags.HTTP_STATUS_CODE, 200)
                return result
            except Exception as e:
                scope.span.set_tag(tags.ERROR, True)
                scope.span.log_kv({'event': 'error', 'error.object': e})
                raise
    
    return wrapper

@traced_function
def generate_text_with_tracing(model_name, prompt, max_tokens=100):
    """带追踪的文本生成"""
    with tracer.start_active_span('llm_inference') as scope:
        scope.span.set_tag('model.name', model_name)
        scope.span.set_tag('prompt.length', len(prompt))
        scope.span.set_tag('max_tokens', max_tokens)
        
        # 记录开始时间
        start_time = time.time()
        
        try:
            # 执行推理
            result = generate_text(model_name, prompt, max_tokens)
            
            # 记录指标
            duration = time.time() - start_time
            scope.span.set_tag('inference.duration', duration)
            scope.span.set_tag('output.length', len(result))
            scope.span.set_tag('tokens.generated', len(result.split()))
            
            return result
            
        except Exception as e:
            scope.span.set_tag('error', True)
            scope.span.log_kv({
                'event': 'inference_failed',
                'error.message': str(e),
                'error.type': type(e).__name__
            })
            raise

安全与合规

5.1 网络安全

网络策略配置

# network-security.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: llm-production-network-policy
  namespace: production
spec:
  podSelector:
    matchLabels:
      environment: production
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: production
    - namespaceSelector:
        matchLabels:
          name: ingress-nginx
    - namespaceSelector:
        matchLabels:
          name: monitoring
    ports:
    - protocol: TCP
      port: 8000
    - protocol: TCP
      port: 9090
  egress:
  - to:
    - namespaceSelector:
        matchLabels:
          name: production
    ports:
    - protocol: TCP
      port: 5432  # PostgreSQL
    - protocol: TCP
      port: 6379  # Redis
    - protocol: TCP
      port: 53    # DNS
    - protocol: UDP
      port: 53    # DNS
  - to:
    - namespaceSelector:
        matchLabels:
          name: kube-system
    ports:
    - protocol: TCP
      port: 443   # Kubernetes API
    - protocol: TCP
      port: 6443  # Kubernetes API

TLS配置

# tls-config.yaml
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: admin@yourcompany.com
    privateKeySecretRef:
      name: letsencrypt-prod
    solvers:
    - http01:
        ingress:
          class: nginx
    - dns01:
        cloudflare:
          email: admin@yourcompany.com
          apiTokenSecretRef:
            name: cloudflare-api-token-secret
            key: api-token
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: llm-api-cert
  namespace: production
spec:
  secretName: llm-api-tls
  issuerRef:
    name: letsencrypt-prod
    kind: ClusterIssuer
  dnsNames:
  - api.llm-production.com
  - llm-api.yourcompany.com
  duration: 2160h # 90 days
  renewBefore: 720h # 30 days
5.2 RBAC权限管理

角色定义

# rbac-config.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: llm-service-reader
rules:
- apiGroups: [""]
  resources: ["pods", "services", "endpoints"]
  verbs: ["get", "list", "watch"]
- apiGroups: ["apps"]
  resources: ["deployments", "statefulsets"]
  verbs: ["get", "list", "watch"]
- apiGroups: ["metrics.k8s.io"]
  resources: ["pods", "nodes"]
  verbs: ["get", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: llm-service-operator
rules:
- apiGroups: [""]
  resources: ["pods", "services", "configmaps", "secrets"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: ["apps"]
  resources: ["deployments", "statefulsets", "replicasets"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: ["autoscaling"]
  resources: ["horizontalpodautoscalers", "verticalpodautoscalers"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: ["networking.k8s.io"]
  resources: ["ingresses", "networkpolicies"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: llm-service-operator-binding
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: llm-service-operator
subjects:
- kind: ServiceAccount
  name: llm-service-operator
  namespace: production

服务账户配置

# service-account.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: llm-service
  namespace: production
automountServiceAccountToken: false
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: llm-service-operator
  namespace: production
---
apiVersion: v1
kind: Secret
metadata:
  name: llm-service-secret
  namespace: production
  annotations:
    kubernetes.io/service-account.name: llm-service
type: kubernetes.io/service-account-token

部署自动化与GitOps

6.1 ArgoCD配置

ArgoCD应用定义

# argocd-app.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: llm-production
  namespace: argocd
  finalizers:
  - resources-finalizer.argocd.argoproj.io
spec:
  project: production
  source:
    repoURL: https://github.com/yourcompany/llm-k8s-configs
    targetRevision: HEAD
    path: environments/production
    helm:
      valueFiles:
      - values-production.yaml
      parameters:
      - name: image.tag
        value: $ARGOCD_APP_REVISION
  destination:
    server: https://kubernetes.default.svc
    namespace: production
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
      allowEmpty: false
    syncOptions:
    - CreateNamespace=true
    - PrunePropagationPolicy=foreground
    - PruneLast=true
    retry:
      limit: 5
      backoff:
        duration: 5s
        factor: 2
        maxDuration: 3m
  revisionHistoryLimit: 10

应用集管理

# argocd-app-set.yaml
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: llm-multi-region
  namespace: argocd
spec:
  generators:
  - list:
      elements:
      - cluster: production-us-east-1
        region: us-east-1
        replicas: 6
      - cluster: production-us-west-2
        region: us-west-2
        replicas: 4
      - cluster: production-eu-west-1
        region: eu-west-1
        replicas: 4
  template:
    metadata:
      name: 'llm-{{region}}'
      namespace: argocd
      finalizers:
      - resources-finalizer.argocd.argoproj.io
    spec:
      project: production
      source:
        repoURL: https://github.com/yourcompany/llm-k8s-configs
        targetRevision: HEAD
        path: environments/production
        helm:
          valueFiles:
          - values-production.yaml
          parameters:
          - name: region
            value: '{{region}}'
          - name: replicas
            value: '{{replicas}}'
      destination:
        server: '{{cluster}}'
        namespace: production
      syncPolicy:
        automated:
          prune: true
          selfHeal: true
        syncOptions:
        - CreateNamespace=true
6.2 渐进式部署

Flagger配置

# canary-deployment.yaml
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: llm-service
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llm-service
  progressDeadlineSeconds: 600
  service:
    port: 80
    targetPort: 8000
    gateways:
    - public-gateway.istio-system.svc.cluster.local
    hosts:
    - api.llm-production.com
  analysis:
    interval: 1m
    threshold: 5
    maxWeight: 50
    stepWeight: 10
    metrics:
    - name: request-success-rate
      thresholdRange:
        min: 99
      interval: 1m
    - name: request-duration
      thresholdRange:
        max: 500
      interval: 30s
    webhooks:
    - name: load-test
      url: http://flagger-loadtester.test/
      timeout: 5s
      metadata:
        cmd: "hey -z 1m -q 10 -c 2 http://api.llm-production.com/v1/generate"
    - name: rollback-hook
      url: http://notification-service.production.svc.cluster.local/rollback
      timeout: 5s
      metadata:
        message: "Canary deployment failed, rolling back"
  skipAnalysis: false

蓝绿部署实现

# blue_green_deploy.py
import kubernetes
from kubernetes import client, config
import time

class BlueGreenDeployer:
    """蓝绿部署管理器"""
    
    def __init__(self, namespace="production"):
        config.load_incluster_config()
        self.apps_v1 = client.AppsV1Api()
        self.core_v1 = client.CoreV1Api()
        self.namespace = namespace
        
    def deploy_blue_green(self, deployment_name, new_image, timeout=600):
        """执行蓝绿部署"""
        
        # 获取当前部署(蓝色环境)
        blue_deployment = self.get_deployment(f"{deployment_name}-blue")
        if not blue_deployment:
            raise Exception("Blue deployment not found")
        
        # 创建绿色环境
        green_deployment = self.create_green_deployment(
            deployment_name, 
            blue_deployment, 
            new_image
        )
        
        # 等待绿色环境就绪
        self.wait_for_deployment_ready(f"{deployment_name}-green", timeout)
        
        # 健康检查
        if not self.health_check(f"{deployment_name}-green"):
            self.rollback_to_blue(deployment_name)
            raise Exception("Health check failed for green deployment")
        
        # 切换流量到绿色环境
        self.switch_traffic_to_green(deployment_name)
        
        # 观察一段时间
        time.sleep(60)  # 观察期
        
        # 如果一切正常,删除蓝色环境
        self.delete_deployment(f"{deployment_name}-blue")
        
        # 重命名绿色环境为蓝色环境(为下次部署做准备)
        self.rename_deployment(f"{deployment_name}-green", f"{deployment_name}-blue")
        
        return True
    
    def create_green_deployment(self, base_name, blue_deployment, new_image):
        """创建绿色部署"""
        green_deployment = client.V1Deployment(
            metadata=client.V1ObjectMeta(
                name=f"{base_name}-green",
                namespace=self.namespace,
                labels={"version": "green", "app": base_name}
            ),
            spec=blue_deployment.spec
        )
        
        # 更新镜像
        for container in green_deployment.spec.template.spec.containers:
            if container.name == base_name:
                container.image = new_image
        
        # 创建部署
        return self.apps_v1.create_namespaced_deployment(
            namespace=self.namespace,
            body=green_deployment
        )
    
    def health_check(self, deployment_name):
        """健康检查"""
        # 实现健康检查逻辑
        # 可以检查Pod状态、服务响应等
        return True
    
    def switch_traffic_to_green(self, base_name):
        """切换流量到绿色环境"""
        # 更新Service选择器
        service = self.core_v1.read_namespaced_service(
            name=f"{base_name}-service",
            namespace=self.namespace
        )
        
        service.spec.selector = {"version": "green", "app": base_name}
        
        self.core_v1.patch_namespaced_service(
            name=f"{base_name}-service",
            namespace=self.namespace,
            body=service
        )
    
    def rollback_to_blue(self, base_name):
        """回滚到蓝色环境"""
        # 切换流量回蓝色环境
        service = self.core_v1.read_namespaced_service(
            name=f"{base_name}-service",
            namespace=self.namespace
        )
        
        service.spec.selector = {"version": "blue", "app": base_name}
        
        self.core_v1.patch_namespaced_service(
            name=f"{base_name}-service",
            namespace=self.namespace,
            body=service
        )
        
        # 删除绿色环境
        try:
            self.apps_v1.delete_namespaced_deployment(
                name=f"{base_name}-green",
                namespace=self.namespace
            )
        except:
            pass

总结

生产环境的Kubernetes部署需要综合考虑高可用性、性能、安全性和可维护性。通过合理的架构设计、完善的监控体系、严格的安全措施和自动化的部署流程,可以构建一个稳定、高效、可扩展的大模型生产环境。

关键成功因素:

  1. 合理的资源规划:根据业务需求规划GPU和CPU资源
  2. 完善的监控体系:实时监控性能、资源使用和业务指标
  3. 严格的安全措施:网络隔离、RBAC权限控制和数据加密
  4. 自动化的运维流程:CI/CD、自动扩缩容和故障恢复
  5. 渐进式部署策略:降低部署风险,确保业务连续性

通过遵循这些最佳实践,可以确保垂直领域大模型在生产环境中稳定、高效地运行。

Logo

更多推荐