Blog

背景

我们用 LiteLLM 搭建了一个 LLM Proxy Gateway，通过 Kustomize 的 base + instances 结构在同一个 Kubernetes namespace 里部署多个实例，共享一个 PostgreSQL 做统一的 API key 管理和用量追踪：

实例	域名	上游	Provider
kelly	llm.goodvision.tech	kellycloudai.com	`openai/*`
yxaiapp	ai.goodvision.tech	yxaiapp.com	`openai/*`
claude	claude.goodvision.tech	Anthropic API	`anthropic/*`

每个实例用 model_name: "*" 做通配路由，透明转发任意模型名到各自的上游。Kustomize 通过 namePrefix 区分资源名，通过 labels 打上 instance 标签。

@startuml
!theme plain
skinparam backgroundColor #FEFEFE
skinparam componentStyle rectangle

cloud "Clients" {
  [JMS Test Client] as client
}

package "K8s Namespace: litellm" {
  [Ingress\nclaude.goodvision.tech] as ing
  [Service: claude-litellm] as svc
  [Pod: claude-litellm] as pod_claude
  [Pod: litellm (kelly)] as pod_kelly
  [Pod: yxaiapp-litellm] as pod_yxaiapp
  database "PostgreSQL" as db
}

cloud "Upstream" {
  [Anthropic API] as anthropic
  [kellycloudai.com] as kelly_up
  [yxaiapp.com] as yxaiapp_up
}

client --> ing
ing --> svc
svc --> pod_claude : 期望
svc ..> pod_kelly : <color:red>意外!</color>
svc ..> pod_yxaiapp : <color:red>意外!</color>
pod_claude --> anthropic
pod_kelly --> kelly_up
pod_yxaiapp --> yxaiapp_up
pod_claude --> db
pod_kelly --> db
pod_yxaiapp --> db
@enduml

问题现象

通过 JMS 测试 claude.goodvision.tech 时，Spend Logs 里出现了诡异的交替模式：

08:29:12 PM  openai/claude-opus-4-6     15633 tokens  cost: -
08:29:03 PM  anthropic/claude-opus-4-6   14333 tokens  cost: $0.076
08:28:54 PM  openai/claude-opus-4-6      168 tokens    cost: -
08:28:46 PM  openai/claude-opus-4-6      13498 tokens  cost: -
08:28:40 PM  anthropic/claude-opus-4-6   12355 tokens  cost: $0.065
08:28:29 PM  openai/claude-opus-4-6      130 tokens    cost: -
08:28:21 PM  openai/claude-opus-4-6      9243 tokens   cost: -
08:28:15 PM  anthropic/claude-opus-4-6   2527 tokens   cost: $0.016

Claude 实例配置的是 anthropic/*，但请求一会儿匹配到 anthropic/claude-opus-4-6，一会儿匹配到 openai/claude-opus-4-6。后端变成了 round-robin，多个后端都被 hit 到。

排查过程

1. 排除数据库交叉污染

三个实例共享 PostgreSQL，首先怀疑数据库里缓存了其他实例的 model 配置：

SELECT * FROM "LiteLLM_ProxyModelTable";
-- (0 rows)

SELECT * FROM "LiteLLM_Config";
-- (0 rows)

数据库里没有额外的 model 配置，排除。

2. 检查 Deployment 和 Service

$ kubectl -n litellm get deployments -o wide
NAME              READY   SELECTOR
claude-litellm    1/1     app=litellm    # <-- 都一样！
litellm           1/1     app=litellm
yxaiapp-litellm   1/1     app=litellm

三个 Deployment 的 selector 全是 app=litellm。

3. 关键证据：Endpoints

$ kubectl -n litellm get endpoints
NAME              ENDPOINTS
claude-litellm    172.16.0.153:4000,172.16.0.236:4000,172.16.0.238:4000
litellm           172.16.0.153:4000,172.16.0.236:4000,172.16.0.238:4000
yxaiapp-litellm   172.16.0.153:4000,172.16.0.236:4000,172.16.0.238:4000

每个 Service 都有 3 个 Endpoint！ claude-litellm Service 把流量 round-robin 分发到了所有三个 pod。

根因分析

问题出在 Kustomize 的 labels 配置：

# kustomization.yaml (所有实例)
labels:
  - pairs:
      instance: claude
    includeSelectors: false  # <-- 罪魁祸首

includeSelectors: false 意味着 instance label 只加到资源的 metadata.labels，不会加到：

Service 的 spec.selector
Deployment 的 spec.selector.matchLabels
Pod template 的 metadata.labels

所以实际生成的资源是：

# Service: claude-litellm
metadata:
  labels:
    instance: claude  # ✅ metadata 有
spec:
  selector:
    app: litellm      # ❌ selector 没有 instance!

# Pod: claude-litellm-xxx
metadata:
  labels:
    app: litellm      # 只有这个
    # instance: claude 不在这里

Service 用 app=litellm 做选择，自然命中了 namespace 里所有带 app=litellm 的 pod。

@startuml
!theme plain

state "includeSelectors: false (Bug)" as bug {
  state "Service selector" as s1 : app=litellm
  state "Pod A labels" as p1 : app=litellm
  state "Pod B labels" as p2 : app=litellm
  state "Pod C labels" as p3 : app=litellm
  s1 --> p1 : match ✅
  s1 --> p2 : match ✅
  s1 --> p3 : match ✅
}

state "includeSelectors: true (Fix)" as fix {
  state "Service selector" as s2 : app=litellm\ninstance=claude
  state "Pod A labels" as p4 : app=litellm\ninstance=claude
  state "Pod B labels" as p5 : app=litellm\ninstance=kelly
  state "Pod C labels" as p6 : app=litellm\ninstance=yxaiapp
  s2 --> p4 : match ✅
  s2 -[#red]-> p5 : no match ❌
  s2 -[#red]-> p6 : no match ❌
}

@enduml

修复

1. 修改 kustomization.yaml

所有实例的 includeSelectors 改为 true：

labels:
  - pairs:
      instance: claude
    includeSelectors: true  # 加到 selector 和 pod labels

2. 重建 Deployment

Deployment 的 spec.selector.matchLabels 是 immutable 的，不能直接更新，必须先删后建：

# 删除旧 deployment
kubectl -n litellm delete deployment litellm claude-litellm yxaiapp-litellm

# 重新 apply
kubectl apply -k deploy/instances/kelly
kubectl apply -k deploy/instances/claude
kubectl apply -k deploy/instances/yxaiapp

3. 验证

$ kubectl -n litellm get pods --show-labels
NAME                          LABELS
claude-litellm-xxx            app=litellm,instance=claude
litellm-xxx                   app=litellm,instance=kelly
yxaiapp-litellm-xxx           app=litellm,instance=yxaiapp

$ kubectl -n litellm get endpoints
NAME              ENDPOINTS
claude-litellm    172.16.0.155:4000   # ✅ 只有 1 个
litellm           172.16.0.154:4000   # ✅ 只有 1 个
yxaiapp-litellm   172.16.0.240:4000   # ✅ 只有 1 个

每个 Service 现在只路由到自己的 Pod。

教训

Kustomize labels 的 includeSelectors 默认行为

includeSelectors: false 是 Kustomize labels transformer 的默认值。它的设计意图是「只打标签，不影响选择逻辑」，适用于：

给资源打 metadata 标签做分类/查询
不需要影响 Service → Pod 的路由关系

但在同一 namespace 部署多个同类应用实例的场景下，必须用 includeSelectors: true，否则所有 Service 会共享 selector，导致跨实例 round-robin。

快速检查方法

如果你怀疑有类似问题，直接看 endpoints 数量：

kubectl get endpoints -n <namespace>

每个 Service 的 endpoint 数量应该等于对应 Deployment 的 replicas 数。如果 endpoint 数量异常多，大概率是 selector 太宽泛了。

为什么 namePrefix 不够

Kustomize 的 namePrefix 只改资源名（Deployment、Service、ConfigMap 等），不改 label selector。资源名不同不代表 selector 不同——selector 才是决定 Service 路由的关键。

Table of Contents

Kustomize includeSelectors 陷阱：多实例 LiteLLM 跨实例路由问题排查

背景