LLM-14垂直领域大模型之生产环境模型部署(K8S完成多机多卡部署)
生产环境K8S部署垂直模型核心方案 Kubernetes为生产环境大模型部署提供自动扩展、高可用和精细资源管理能力。核心架构采用多集群设计,通过全局负载均衡器分发流量。 关键部署要素: GPU资源管理:配置NVIDIA设备插件DaemonSet,实现GPU资源调度 节点分组:划分GPU节点组(如p3.8xlarge)和CPU节点组,通过标签区分工作负载 服务部署: 使用StatefulSet部署L
·
14. 生产环境,垂直模型的如何部署?生产环境使用K8S完成多机多卡部署的核心流程和方案
生产环境K8S部署概述
为什么选择Kubernetes?
Kubernetes作为容器编排平台,为生产环境的大模型部署提供了以下核心优势:
核心能力对比
能力维度 | 传统部署 | Kubernetes部署 |
---|---|---|
扩展性 | 手动扩展,响应慢 | 自动扩展,秒级响应 |
高可用 | 单点故障风险 | 自动故障恢复 |
资源管理 | 资源利用率低 | 精细化资源调度 |
运维复杂度 | 高运维负担 | 声明式运维 |
成本优化 | 资源浪费严重 | 按需分配,成本可控 |
生产环境架构设计
1.1 整体架构
多集群架构设计
单集群内部架构
# cluster-architecture.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-architecture
namespace: production
data:
architecture.yaml: |
cluster:
name: "llm-production"
region: "us-east-1"
availability_zones: ["us-east-1a", "us-east-1b", "us-east-1c"]
node_groups:
gpu_nodes:
instance_type: "p3.8xlarge" # 4x V100 GPUs
min_nodes: 3
max_nodes: 12
gpu_per_node: 4
labels:
node-type: "gpu"
workload-type: "llm-inference"
cpu_nodes:
instance_type: "c5.4xlarge"
min_nodes: 6
max_nodes: 20
labels:
node-type: "cpu"
workload-type: "api-gateway"
system_nodes:
instance_type: "t3.large"
min_nodes: 3
max_nodes: 6
labels:
node-type: "system"
workload-type: "monitoring"
1.2 GPU资源管理
GPU节点配置
# gpu-node-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: gpu-node-config
namespace: kube-system
data:
nvidia-device-plugin.yml: |
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-device-plugin-daemonset
namespace: kube-system
spec:
selector:
matchLabels:
name: nvidia-device-plugin-ds
template:
metadata:
labels:
name: nvidia-device-plugin-ds
spec:
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- image: nvcr.io/nvidia/k8s-device-plugin:v0.14.0
name: nvidia-device-plugin-ctr
env:
- name: FAIL_ON_INIT_ERROR
value: "false"
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
GPU资源调度策略
# gpu-scheduler-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: scheduler-config
namespace: kube-system
data:
config.yaml: |
apiVersion: kubescheduler.config.k8s.io/v1beta3
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: gpu-scheduler
plugins:
filter:
enabled:
- name: NodeResourcesFit
score:
enabled:
- name: NodeResourcesFit
- name: PodTopologySpread
pluginConfig:
- name: NodeResourcesFit
args:
scoringStrategy:
type: MostAllocated
resources:
- name: nvidia.com/gpu
weight: 100
- name: cpu
weight: 10
- name: memory
weight: 10
核心组件部署
2.1 LLM服务部署
StatefulSet配置
# llm-service-statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: llm-service
namespace: production
spec:
serviceName: llm-service
replicas: 3
selector:
matchLabels:
app: llm-service
template:
metadata:
labels:
app: llm-service
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-type
operator: In
values: ["gpu"]
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values: ["llm-service"]
topologyKey: kubernetes.io/hostname
containers:
- name: llm-service
image: your-registry/llm-service:v1.0.0
ports:
- containerPort: 8000
name: http
- containerPort: 8001
name: metrics
resources:
requests:
nvidia.com/gpu: "2"
memory: "32Gi"
cpu: "8"
limits:
nvidia.com/gpu: "2"
memory: "48Gi"
cpu: "16"
env:
- name: MODEL_NAME
value: "meta-llama/Llama-2-70b-hf"
- name: TENSOR_PARALLEL_SIZE
value: "2"
- name: GPU_MEMORY_UTILIZATION
value: "0.9"
- name: MAX_MODEL_LEN
value: "4096"
volumeMounts:
- name: model-storage
mountPath: /models
- name: config-volume
mountPath: /config
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 300
periodSeconds: 30
timeoutSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8000
initialDelaySeconds: 60
periodSeconds: 10
timeoutSeconds: 5
volumes:
- name: model-storage
persistentVolumeClaim:
claimName: llm-model-pvc
- name: config-volume
configMap:
name: llm-service-config
volumeClaimTemplates:
- metadata:
name: model-storage
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: fast-ssd
resources:
requests:
storage: 100Gi
服务配置
# llm-service-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: llm-service-config
namespace: production
data:
config.yaml: |
model:
name: "meta-llama/Llama-2-70b-hf"
quantization: "awq"
bits: 4
group_size: 128
serving:
max_batch_size: 256
max_num_seqs: 256
tensor_parallel_size: 2
gpu_memory_utilization: 0.9
swap_space: 4
optimization:
enable_prefix_caching: true
enable_chunked_prefill: true
max_num_batched_tokens: 4096
logging:
level: "INFO"
format: "json"
monitoring:
enable_metrics: true
metrics_port: 8001
服务暴露
# llm-service.yaml
apiVersion: v1
kind: Service
metadata:
name: llm-service
namespace: production
labels:
app: llm-service
spec:
ports:
- port: 80
targetPort: 8000
name: http
- port: 9090
targetPort: 8001
name: metrics
clusterIP: None
selector:
app: llm-service
---
apiVersion: v1
kind: Service
metadata:
name: llm-service-loadbalancer
namespace: production
spec:
type: LoadBalancer
ports:
- port: 80
targetPort: 8000
name: http
selector:
app: llm-service
sessionAffinity: ClientIP
sessionAffinityConfig:
clientIP:
timeoutSeconds: 10800
2.2 API网关部署
Ingress控制器配置
# nginx-ingress-controller.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-ingress-controller
namespace: ingress-nginx
spec:
replicas: 3
selector:
matchLabels:
app.kubernetes.io/name: ingress-nginx
app.kubernetes.io/component: controller
template:
metadata:
labels:
app.kubernetes.io/name: ingress-nginx
app.kubernetes.io/component: controller
spec:
containers:
- name: controller
image: registry.k8s.io/ingress-nginx/controller:v1.8.1
args:
- /nginx-ingress-controller
- --configmap=$(POD_NAMESPACE)/nginx-configuration
- --tcp-services-configmap=$(POD_NAMESPACE)/tcp-services
- --udp-services-configmap=$(POD_NAMESPACE)/udp-services
- --annotations-prefix=nginx.ingress.kubernetes.io
- --enable-metrics=true
- --metrics-port=10254
env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
ports:
- name: http
containerPort: 80
- name: https
containerPort: 443
- name: metrics
containerPort: 10254
resources:
requests:
cpu: 100m
memory: 90Mi
limits:
cpu: 500m
memory: 512Mi
API路由配置
# api-ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: llm-api-ingress
namespace: production
annotations:
nginx.ingress.kubernetes.io/rewrite-target: /
nginx.ingress.kubernetes.io/ssl-redirect: "true"
nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
nginx.ingress.kubernetes.io/rate-limit: "100"
nginx.ingress.kubernetes.io/rate-limit-window: "1m"
cert-manager.io/cluster-issuer: "letsencrypt-prod"
spec:
ingressClassName: nginx
tls:
- hosts:
- api.llm-production.com
secretName: llm-api-tls
rules:
- host: api.llm-production.com
http:
paths:
- path: /v1/generate
pathType: Prefix
backend:
service:
name: llm-service-loadbalancer
port:
number: 80
- path: /v1/health
pathType: Prefix
backend:
service:
name: llm-service-loadbalancer
port:
number: 80
2.3 负载均衡与自动扩展
Horizontal Pod Autoscaler
# hpa-config.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: llm-service-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: StatefulSet
name: llm-service
minReplicas: 3
maxReplicas: 12
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "50"
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 50
periodSeconds: 60
- type: Pods
value: 2
periodSeconds: 60
selectPolicy: Max
Vertical Pod Autoscaler
# vpa-config.yaml
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: llm-service-vpa
namespace: production
spec:
targetRef:
apiVersion: apps/v1
kind: StatefulSet
name: llm-service
updatePolicy:
updateMode: "Auto"
resourcePolicy:
containerPolicies:
- containerName: llm-service
maxAllowed:
cpu: 32
memory: 128Gi
minAllowed:
cpu: 4
memory: 16Gi
controlledResources: ["cpu", "memory"]
存储与数据管理
3.1 模型存储方案
分布式存储配置
# distributed-storage.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
name: llm-model-pv
spec:
capacity:
storage: 200Gi
accessModes:
- ReadWriteMany
persistentVolumeReclaimPolicy: Retain
storageClassName: distributed-fast
nfs:
server: nfs-server.production.svc.cluster.local
path: "/models"
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: llm-model-pvc
namespace: production
spec:
accessModes:
- ReadWriteMany
storageClassName: distributed-fast
resources:
requests:
storage: 200Gi
模型版本管理
# model_version_manager.py
import os
import json
import hashlib
from datetime import datetime
import boto3
class ModelVersionManager:
"""模型版本管理器"""
def __init__(self, storage_backend="s3", bucket_name="llm-models"):
self.storage_backend = storage_backend
self.bucket_name = bucket_name
if storage_backend == "s3":
self.s3_client = boto3.client('s3')
def upload_model(self, model_path, model_name, version):
"""上传模型"""
# 计算模型哈希
model_hash = self.calculate_file_hash(model_path)
# 构建S3路径
s3_key = f"models/{model_name}/v{version}/model.bin"
# 上传模型文件
self.s3_client.upload_file(model_path, self.bucket_name, s3_key)
# 创建版本元数据
version_metadata = {
"model_name": model_name,
"version": version,
"hash": model_hash,
"upload_time": datetime.now().isoformat(),
"file_size": os.path.getsize(model_path),
"metadata": self.extract_model_metadata(model_path)
}
# 保存元数据
metadata_key = f"models/{model_name}/v{version}/metadata.json"
self.s3_client.put_object(
Bucket=self.bucket_name,
Key=metadata_key,
Body=json.dumps(version_metadata)
)
return version_metadata
def promote_model(self, model_name, version, environment):
"""提升模型到指定环境"""
# 获取模型元数据
metadata = self.get_model_metadata(model_name, version)
# 创建提升记录
promotion_record = {
"model_name": model_name,
"version": version,
"environment": environment,
"promotion_time": datetime.now().isoformat(),
"promoted_by": os.getenv("USER", "system"),
"metadata": metadata
}
# 保存提升记录
promotion_key = f"promotions/{environment}/{model_name}/v{version}.json"
self.s3_client.put_object(
Bucket=self.bucket_name,
Key=promotion_key,
Body=json.dumps(promotion_record)
)
return promotion_record
def get_model_metadata(self, model_name, version):
"""获取模型元数据"""
metadata_key = f"models/{model_name}/v{version}/metadata.json"
try:
response = self.s3_client.get_object(
Bucket=self.bucket_name,
Key=metadata_key
)
return json.loads(response['Body'].read())
except self.s3_client.exceptions.NoSuchKey:
return None
def calculate_file_hash(self, file_path):
"""计算文件哈希"""
sha256_hash = hashlib.sha256()
with open(file_path, "rb") as f:
for byte_block in iter(lambda: f.read(4096), b""):
sha256_hash.update(byte_block)
return sha256_hash.hexdigest()
def extract_model_metadata(self, model_path):
"""提取模型元数据"""
# 这里可以添加模型特定的元数据提取逻辑
return {
"framework": "pytorch",
"quantization": "awq",
"bits": 4
}
3.2 配置管理
ConfigMap管理
# config-management.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: llm-config-template
namespace: production
data:
model_config.yaml: |
model:
name: "{{ .Values.model.name }}"
quantization: "{{ .Values.model.quantization }}"
bits: {{ .Values.model.bits }}
serving:
max_batch_size: {{ .Values.serving.maxBatchSize }}
max_num_seqs: {{ .Values.serving.maxNumSeqs }}
tensor_parallel_size: {{ .Values.serving.tensorParallelSize }}
optimization:
enable_prefix_caching: {{ .Values.optimization.enablePrefixCaching }}
enable_chunked_prefill: {{ .Values.optimization.enableChunkedPrefill }}
---
apiVersion: v1
kind: Secret
metadata:
name: llm-secrets
namespace: production
type: Opaque
stringData:
database-url: "postgresql://user:password@db.production.svc.cluster.local:5432/llm"
api-key: "your-api-key-here"
model-registry-token: "your-registry-token"
配置热更新
# config_hot_reloader.py
import os
import time
import yaml
from kubernetes import client, config
import threading
class ConfigHotReloader:
"""配置热重载器"""
def __init__(self, namespace="production"):
config.load_incluster_config()
self.v1 = client.CoreV1Api()
self.namespace = namespace
self.watch_thread = None
self.running = False
def start_watching(self, config_map_name, callback):
"""开始监控配置变化"""
self.running = True
self.watch_thread = threading.Thread(
target=self._watch_config_map,
args=(config_map_name, callback)
)
self.watch_thread.daemon = True
self.watch_thread.start()
def stop_watching(self):
"""停止监控"""
self.running = False
if self.watch_thread:
self.watch_thread.join()
def _watch_config_map(self, config_map_name, callback):
"""监控ConfigMap变化"""
w = client.Watch()
for event in w.stream(
self.v1.list_namespaced_config_map,
namespace=self.namespace,
field_selector=f"metadata.name={config_map_name}"
):
if not self.running:
break
config_map = event['object']
event_type = event['type']
if event_type in ['ADDED', 'MODIFIED']:
print(f"ConfigMap {config_map_name} {event_type}")
# 提取配置数据
config_data = config_map.data
# 调用回调函数处理新配置
try:
callback(config_data)
except Exception as e:
print(f"Error processing config change: {e}")
w.stop()
def reload_configuration(self, config_data):
"""重载配置"""
# 解析YAML配置
try:
new_config = yaml.safe_load(config_data.get('config.yaml', ''))
# 验证配置格式
if self.validate_config(new_config):
# 应用新配置
self.apply_config(new_config)
print("Configuration reloaded successfully")
else:
print("Invalid configuration format, keeping current config")
except yaml.YAMLError as e:
print(f"Error parsing configuration: {e}")
def validate_config(self, config):
"""验证配置格式"""
required_keys = ['model', 'serving', 'optimization']
for key in required_keys:
if key not in config:
return False
return True
def apply_config(self, config):
"""应用新配置"""
# 这里可以实现具体的配置应用逻辑
# 例如:更新服务参数、重启组件等
print(f"Applying new configuration: {config}")
# 通知其他组件配置已更新
self.notify_config_update(config)
def notify_config_update(self, config):
"""通知配置更新"""
# 可以通过消息队列、WebSocket等方式通知其他组件
pass
# 使用示例
def handle_config_change(config_data):
"""处理配置变化"""
print("Configuration changed, reloading...")
# 实现具体的配置重载逻辑
pass
reloader = ConfigHotReloader()
reloader.start_watching("llm-service-config", handle_config_change)
监控与可观测性
4.1 综合监控体系
Prometheus + Grafana监控栈
# monitoring-stack.yaml
apiVersion: v1
kind: Namespace
metadata:
name: monitoring
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: prometheus
template:
metadata:
labels:
app: prometheus
spec:
containers:
- name: prometheus
image: prom/prometheus:v2.45.0
args:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus/'
- '--web.console.libraries=/etc/prometheus/console_libraries'
- '--web.console.templates=/etc/prometheus/consoles'
- '--web.enable-lifecycle'
ports:
- containerPort: 9090
volumeMounts:
- name: prometheus-config
mountPath: /etc/prometheus
- name: prometheus-storage
mountPath: /prometheus
resources:
requests:
cpu: 500m
memory: 1Gi
limits:
cpu: 2
memory: 4Gi
volumes:
- name: prometheus-config
configMap:
name: prometheus-config
- name: prometheus-storage
persistentVolumeClaim:
claimName: prometheus-pvc
---
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: monitoring
data:
prometheus.yml: |
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'llm-service'
kubernetes_sd_configs:
- role: pod
namespaces:
- production
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
action: keep
regex: llm-service
- source_labels: [__meta_kubernetes_pod_name]
target_label: pod
- source_labels: [__meta_kubernetes_namespace]
target_label: namespace
metrics_path: /metrics
scrape_interval: 10s
- job_name: 'kubernetes-nodes'
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics
自定义监控指标
# custom_metrics.py
from prometheus_client import Counter, Histogram, Gauge, start_http_server
import time
import psutil
import GPUtil
class LLMMetricsCollector:
"""LLM自定义指标收集器"""
def __init__(self, port=8001):
# 业务指标
self.request_count = Counter(
'llm_requests_total',
'Total number of LLM requests',
['model_name', 'status']
)
self.request_duration = Histogram(
'llm_request_duration_seconds',
'LLM request duration in seconds',
['model_name', 'batch_size']
)
self.tokens_generated = Counter(
'llm_tokens_generated_total',
'Total number of tokens generated',
['model_name']
)
# 系统指标
self.gpu_utilization = Gauge(
'gpu_utilization_percent',
'GPU utilization percentage',
['gpu_index', 'node_name']
)
self.gpu_memory_used = Gauge(
'gpu_memory_used_bytes',
'GPU memory used in bytes',
['gpu_index', 'node_name']
)
self.model_load_time = Histogram(
'llm_model_load_duration_seconds',
'Model loading duration in seconds',
['model_name']
)
# 启动指标服务器
start_http_server(port)
def record_request(self, model_name, duration, batch_size=1, status="success"):
"""记录请求指标"""
self.request_count.labels(model_name=model_name, status=status).inc()
self.request_duration.labels(model_name=model_name, batch_size=batch_size).observe(duration)
def record_tokens(self, model_name, token_count):
"""记录生成的token数"""
self.tokens_generated.labels(model_name=model_name).inc(token_count)
def update_gpu_metrics(self, node_name):
"""更新GPU指标"""
try:
gpus = GPUtil.getGPUs()
for i, gpu in enumerate(gpus):
self.gpu_utilization.labels(gpu_index=i, node_name=node_name).set(gpu.load * 100)
self.gpu_memory_used.labels(gpu_index=i, node_name=node_name).set(gpu.memoryUsed * 1024 * 1024)
except Exception as e:
print(f"Error updating GPU metrics: {e}")
def record_model_load(self, model_name, duration):
"""记录模型加载时间"""
self.model_load_time.labels(model_name=model_name).observe(duration)
# 使用示例
metrics_collector = LLMMetricsCollector()
# 在应用中使用
def handle_request(model_name, prompt):
start_time = time.time()
try:
# 生成回复
response = generate_response(model_name, prompt)
# 记录成功指标
duration = time.time() - start_time
metrics_collector.record_request(model_name, duration, status="success")
metrics_collector.record_tokens(model_name, len(response.split()))
return response
except Exception as e:
# 记录失败指标
duration = time.time() - start_time
metrics_collector.record_request(model_name, duration, status="failed")
raise
4.2 分布式追踪
Jaeger集成
# jaeger-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: jaeger
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: jaeger
template:
metadata:
labels:
app: jaeger
spec:
containers:
- name: jaeger
image: jaegertracing/all-in-one:1.47
ports:
- containerPort: 5775
protocol: UDP
- containerPort: 6831
protocol: UDP
- containerPort: 6832
protocol: UDP
- containerPort: 5778
protocol: TCP
- containerPort: 16686
protocol: TCP
- containerPort: 14268
protocol: TCP
- containerPort: 14250
protocol: TCP
- containerPort: 9411
protocol: TCP
env:
- name: COLLECTOR_ZIPKIN_HOST_PORT
value: ":9411"
- name: SPAN_STORAGE_TYPE
value: "elasticsearch"
- name: ES_SERVER_URLS
value: "http://elasticsearch.monitoring.svc.cluster.local:9200"
应用端追踪集成
# tracing_integration.py
from jaeger_client import Config
from opentracing.ext import tags
from opentracing.propagation import Format
import opentracing
def init_tracer(service_name='llm-service'):
"""初始化追踪器"""
config = Config(
config={
'sampler': {
'type': 'const',
'param': 1,
},
'local_agent': {
'reporting_host': 'jaeger-agent.monitoring.svc.cluster.local',
'reporting_port': '6831',
},
'logging': True,
},
service_name=service_name,
validate=True,
)
return config.initialize_tracer()
tracer = init_tracer()
def traced_function(func):
"""追踪装饰器"""
def wrapper(*args, **kwargs):
with tracer.start_active_span(func.__name__) as scope:
try:
result = func(*args, **kwargs)
scope.span.set_tag(tags.HTTP_STATUS_CODE, 200)
return result
except Exception as e:
scope.span.set_tag(tags.ERROR, True)
scope.span.log_kv({'event': 'error', 'error.object': e})
raise
return wrapper
@traced_function
def generate_text_with_tracing(model_name, prompt, max_tokens=100):
"""带追踪的文本生成"""
with tracer.start_active_span('llm_inference') as scope:
scope.span.set_tag('model.name', model_name)
scope.span.set_tag('prompt.length', len(prompt))
scope.span.set_tag('max_tokens', max_tokens)
# 记录开始时间
start_time = time.time()
try:
# 执行推理
result = generate_text(model_name, prompt, max_tokens)
# 记录指标
duration = time.time() - start_time
scope.span.set_tag('inference.duration', duration)
scope.span.set_tag('output.length', len(result))
scope.span.set_tag('tokens.generated', len(result.split()))
return result
except Exception as e:
scope.span.set_tag('error', True)
scope.span.log_kv({
'event': 'inference_failed',
'error.message': str(e),
'error.type': type(e).__name__
})
raise
安全与合规
5.1 网络安全
网络策略配置
# network-security.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: llm-production-network-policy
namespace: production
spec:
podSelector:
matchLabels:
environment: production
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: production
- namespaceSelector:
matchLabels:
name: ingress-nginx
- namespaceSelector:
matchLabels:
name: monitoring
ports:
- protocol: TCP
port: 8000
- protocol: TCP
port: 9090
egress:
- to:
- namespaceSelector:
matchLabels:
name: production
ports:
- protocol: TCP
port: 5432 # PostgreSQL
- protocol: TCP
port: 6379 # Redis
- protocol: TCP
port: 53 # DNS
- protocol: UDP
port: 53 # DNS
- to:
- namespaceSelector:
matchLabels:
name: kube-system
ports:
- protocol: TCP
port: 443 # Kubernetes API
- protocol: TCP
port: 6443 # Kubernetes API
TLS配置
# tls-config.yaml
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-prod
spec:
acme:
server: https://acme-v02.api.letsencrypt.org/directory
email: admin@yourcompany.com
privateKeySecretRef:
name: letsencrypt-prod
solvers:
- http01:
ingress:
class: nginx
- dns01:
cloudflare:
email: admin@yourcompany.com
apiTokenSecretRef:
name: cloudflare-api-token-secret
key: api-token
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: llm-api-cert
namespace: production
spec:
secretName: llm-api-tls
issuerRef:
name: letsencrypt-prod
kind: ClusterIssuer
dnsNames:
- api.llm-production.com
- llm-api.yourcompany.com
duration: 2160h # 90 days
renewBefore: 720h # 30 days
5.2 RBAC权限管理
角色定义
# rbac-config.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: llm-service-reader
rules:
- apiGroups: [""]
resources: ["pods", "services", "endpoints"]
verbs: ["get", "list", "watch"]
- apiGroups: ["apps"]
resources: ["deployments", "statefulsets"]
verbs: ["get", "list", "watch"]
- apiGroups: ["metrics.k8s.io"]
resources: ["pods", "nodes"]
verbs: ["get", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: llm-service-operator
rules:
- apiGroups: [""]
resources: ["pods", "services", "configmaps", "secrets"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: ["apps"]
resources: ["deployments", "statefulsets", "replicasets"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: ["autoscaling"]
resources: ["horizontalpodautoscalers", "verticalpodautoscalers"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: ["networking.k8s.io"]
resources: ["ingresses", "networkpolicies"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: llm-service-operator-binding
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: llm-service-operator
subjects:
- kind: ServiceAccount
name: llm-service-operator
namespace: production
服务账户配置
# service-account.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: llm-service
namespace: production
automountServiceAccountToken: false
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: llm-service-operator
namespace: production
---
apiVersion: v1
kind: Secret
metadata:
name: llm-service-secret
namespace: production
annotations:
kubernetes.io/service-account.name: llm-service
type: kubernetes.io/service-account-token
部署自动化与GitOps
6.1 ArgoCD配置
ArgoCD应用定义
# argocd-app.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: llm-production
namespace: argocd
finalizers:
- resources-finalizer.argocd.argoproj.io
spec:
project: production
source:
repoURL: https://github.com/yourcompany/llm-k8s-configs
targetRevision: HEAD
path: environments/production
helm:
valueFiles:
- values-production.yaml
parameters:
- name: image.tag
value: $ARGOCD_APP_REVISION
destination:
server: https://kubernetes.default.svc
namespace: production
syncPolicy:
automated:
prune: true
selfHeal: true
allowEmpty: false
syncOptions:
- CreateNamespace=true
- PrunePropagationPolicy=foreground
- PruneLast=true
retry:
limit: 5
backoff:
duration: 5s
factor: 2
maxDuration: 3m
revisionHistoryLimit: 10
应用集管理
# argocd-app-set.yaml
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
name: llm-multi-region
namespace: argocd
spec:
generators:
- list:
elements:
- cluster: production-us-east-1
region: us-east-1
replicas: 6
- cluster: production-us-west-2
region: us-west-2
replicas: 4
- cluster: production-eu-west-1
region: eu-west-1
replicas: 4
template:
metadata:
name: 'llm-{{region}}'
namespace: argocd
finalizers:
- resources-finalizer.argocd.argoproj.io
spec:
project: production
source:
repoURL: https://github.com/yourcompany/llm-k8s-configs
targetRevision: HEAD
path: environments/production
helm:
valueFiles:
- values-production.yaml
parameters:
- name: region
value: '{{region}}'
- name: replicas
value: '{{replicas}}'
destination:
server: '{{cluster}}'
namespace: production
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
6.2 渐进式部署
Flagger配置
# canary-deployment.yaml
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: llm-service
namespace: production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: llm-service
progressDeadlineSeconds: 600
service:
port: 80
targetPort: 8000
gateways:
- public-gateway.istio-system.svc.cluster.local
hosts:
- api.llm-production.com
analysis:
interval: 1m
threshold: 5
maxWeight: 50
stepWeight: 10
metrics:
- name: request-success-rate
thresholdRange:
min: 99
interval: 1m
- name: request-duration
thresholdRange:
max: 500
interval: 30s
webhooks:
- name: load-test
url: http://flagger-loadtester.test/
timeout: 5s
metadata:
cmd: "hey -z 1m -q 10 -c 2 http://api.llm-production.com/v1/generate"
- name: rollback-hook
url: http://notification-service.production.svc.cluster.local/rollback
timeout: 5s
metadata:
message: "Canary deployment failed, rolling back"
skipAnalysis: false
蓝绿部署实现
# blue_green_deploy.py
import kubernetes
from kubernetes import client, config
import time
class BlueGreenDeployer:
"""蓝绿部署管理器"""
def __init__(self, namespace="production"):
config.load_incluster_config()
self.apps_v1 = client.AppsV1Api()
self.core_v1 = client.CoreV1Api()
self.namespace = namespace
def deploy_blue_green(self, deployment_name, new_image, timeout=600):
"""执行蓝绿部署"""
# 获取当前部署(蓝色环境)
blue_deployment = self.get_deployment(f"{deployment_name}-blue")
if not blue_deployment:
raise Exception("Blue deployment not found")
# 创建绿色环境
green_deployment = self.create_green_deployment(
deployment_name,
blue_deployment,
new_image
)
# 等待绿色环境就绪
self.wait_for_deployment_ready(f"{deployment_name}-green", timeout)
# 健康检查
if not self.health_check(f"{deployment_name}-green"):
self.rollback_to_blue(deployment_name)
raise Exception("Health check failed for green deployment")
# 切换流量到绿色环境
self.switch_traffic_to_green(deployment_name)
# 观察一段时间
time.sleep(60) # 观察期
# 如果一切正常,删除蓝色环境
self.delete_deployment(f"{deployment_name}-blue")
# 重命名绿色环境为蓝色环境(为下次部署做准备)
self.rename_deployment(f"{deployment_name}-green", f"{deployment_name}-blue")
return True
def create_green_deployment(self, base_name, blue_deployment, new_image):
"""创建绿色部署"""
green_deployment = client.V1Deployment(
metadata=client.V1ObjectMeta(
name=f"{base_name}-green",
namespace=self.namespace,
labels={"version": "green", "app": base_name}
),
spec=blue_deployment.spec
)
# 更新镜像
for container in green_deployment.spec.template.spec.containers:
if container.name == base_name:
container.image = new_image
# 创建部署
return self.apps_v1.create_namespaced_deployment(
namespace=self.namespace,
body=green_deployment
)
def health_check(self, deployment_name):
"""健康检查"""
# 实现健康检查逻辑
# 可以检查Pod状态、服务响应等
return True
def switch_traffic_to_green(self, base_name):
"""切换流量到绿色环境"""
# 更新Service选择器
service = self.core_v1.read_namespaced_service(
name=f"{base_name}-service",
namespace=self.namespace
)
service.spec.selector = {"version": "green", "app": base_name}
self.core_v1.patch_namespaced_service(
name=f"{base_name}-service",
namespace=self.namespace,
body=service
)
def rollback_to_blue(self, base_name):
"""回滚到蓝色环境"""
# 切换流量回蓝色环境
service = self.core_v1.read_namespaced_service(
name=f"{base_name}-service",
namespace=self.namespace
)
service.spec.selector = {"version": "blue", "app": base_name}
self.core_v1.patch_namespaced_service(
name=f"{base_name}-service",
namespace=self.namespace,
body=service
)
# 删除绿色环境
try:
self.apps_v1.delete_namespaced_deployment(
name=f"{base_name}-green",
namespace=self.namespace
)
except:
pass
总结
生产环境的Kubernetes部署需要综合考虑高可用性、性能、安全性和可维护性。通过合理的架构设计、完善的监控体系、严格的安全措施和自动化的部署流程,可以构建一个稳定、高效、可扩展的大模型生产环境。
关键成功因素:
- 合理的资源规划:根据业务需求规划GPU和CPU资源
- 完善的监控体系:实时监控性能、资源使用和业务指标
- 严格的安全措施:网络隔离、RBAC权限控制和数据加密
- 自动化的运维流程:CI/CD、自动扩缩容和故障恢复
- 渐进式部署策略:降低部署风险,确保业务连续性
通过遵循这些最佳实践,可以确保垂直领域大模型在生产环境中稳定、高效地运行。
更多推荐
所有评论(0)