Kustomize includeSelectors 陷阱:多实例 LiteLLM 跨实例路由问题排查

Kiyor
2026年02月26日 12:47
Show TOC

背景

我们用 LiteLLM 搭建了一个 LLM Proxy Gateway,通过 Kustomize 的 base + instances 结构在同一个 Kubernetes namespace 里部署多个实例,共享一个 PostgreSQL 做统一的 API key 管理和用量追踪:

实例 域名 上游 Provider
kelly llm.goodvision.tech kellycloudai.com openai/*
yxaiapp ai.goodvision.tech yxaiapp.com openai/*
claude claude.goodvision.tech Anthropic API anthropic/*

每个实例用 model_name: "*" 做通配路由,透明转发任意模型名到各自的上游。Kustomize 通过 namePrefix 区分资源名,通过 labels 打上 instance 标签。

@startuml
!theme plain
skinparam backgroundColor #FEFEFE
skinparam componentStyle rectangle

cloud "Clients" {
  [JMS Test Client] as client
}

package "K8s Namespace: litellm" {
  [Ingress\nclaude.goodvision.tech] as ing
  [Service: claude-litellm] as svc
  [Pod: claude-litellm] as pod_claude
  [Pod: litellm (kelly)] as pod_kelly
  [Pod: yxaiapp-litellm] as pod_yxaiapp
  database "PostgreSQL" as db
}

cloud "Upstream" {
  [Anthropic API] as anthropic
  [kellycloudai.com] as kelly_up
  [yxaiapp.com] as yxaiapp_up
}

client --> ing
ing --> svc
svc --> pod_claude : 期望
svc ..> pod_kelly : <color:red>意外!</color>
svc ..> pod_yxaiapp : <color:red>意外!</color>
pod_claude --> anthropic
pod_kelly --> kelly_up
pod_yxaiapp --> yxaiapp_up
pod_claude --> db
pod_kelly --> db
pod_yxaiapp --> db
@enduml

问题现象

通过 JMS 测试 claude.goodvision.tech 时,Spend Logs 里出现了诡异的交替模式:

08:29:12 PM  openai/claude-opus-4-6     15633 tokens  cost: -
08:29:03 PM  anthropic/claude-opus-4-6   14333 tokens  cost: $0.076
08:28:54 PM  openai/claude-opus-4-6      168 tokens    cost: -
08:28:46 PM  openai/claude-opus-4-6      13498 tokens  cost: -
08:28:40 PM  anthropic/claude-opus-4-6   12355 tokens  cost: $0.065
08:28:29 PM  openai/claude-opus-4-6      130 tokens    cost: -
08:28:21 PM  openai/claude-opus-4-6      9243 tokens   cost: -
08:28:15 PM  anthropic/claude-opus-4-6   2527 tokens   cost: $0.016

Claude 实例配置的是 anthropic/*,但请求一会儿匹配到 anthropic/claude-opus-4-6,一会儿匹配到 openai/claude-opus-4-6。后端变成了 round-robin,多个后端都被 hit 到。

排查过程

1. 排除数据库交叉污染

三个实例共享 PostgreSQL,首先怀疑数据库里缓存了其他实例的 model 配置:

SELECT * FROM "LiteLLM_ProxyModelTable";
-- (0 rows)

SELECT * FROM "LiteLLM_Config";
-- (0 rows)

数据库里没有额外的 model 配置,排除。

2. 检查 Deployment 和 Service

$ kubectl -n litellm get deployments -o wide
NAME              READY   SELECTOR
claude-litellm    1/1     app=litellm    # <-- 都一样!
litellm           1/1     app=litellm
yxaiapp-litellm   1/1     app=litellm

三个 Deployment 的 selector 全是 app=litellm

3. 关键证据:Endpoints

$ kubectl -n litellm get endpoints
NAME              ENDPOINTS
claude-litellm    172.16.0.153:4000,172.16.0.236:4000,172.16.0.238:4000
litellm           172.16.0.153:4000,172.16.0.236:4000,172.16.0.238:4000
yxaiapp-litellm   172.16.0.153:4000,172.16.0.236:4000,172.16.0.238:4000

每个 Service 都有 3 个 Endpoint! claude-litellm Service 把流量 round-robin 分发到了所有三个 pod。

根因分析

问题出在 Kustomize 的 labels 配置:

# kustomization.yaml (所有实例)
labels:
  - pairs:
      instance: claude
    includeSelectors: false  # <-- 罪魁祸首

includeSelectors: false 意味着 instance label 只加到资源的 metadata.labels,不会加到:

  • Service 的 spec.selector
  • Deployment 的 spec.selector.matchLabels
  • Pod template 的 metadata.labels

所以实际生成的资源是:

# Service: claude-litellm
metadata:
  labels:
    instance: claude  # ✅ metadata 有
spec:
  selector:
    app: litellm      # ❌ selector 没有 instance!

# Pod: claude-litellm-xxx
metadata:
  labels:
    app: litellm      # 只有这个
    # instance: claude 不在这里

Service 用 app=litellm 做选择,自然命中了 namespace 里所有带 app=litellm 的 pod。

@startuml
!theme plain

state "includeSelectors: false (Bug)" as bug {
  state "Service selector" as s1 : app=litellm
  state "Pod A labels" as p1 : app=litellm
  state "Pod B labels" as p2 : app=litellm
  state "Pod C labels" as p3 : app=litellm
  s1 --> p1 : match ✅
  s1 --> p2 : match ✅
  s1 --> p3 : match ✅
}

state "includeSelectors: true (Fix)" as fix {
  state "Service selector" as s2 : app=litellm\ninstance=claude
  state "Pod A labels" as p4 : app=litellm\ninstance=claude
  state "Pod B labels" as p5 : app=litellm\ninstance=kelly
  state "Pod C labels" as p6 : app=litellm\ninstance=yxaiapp
  s2 --> p4 : match ✅
  s2 -[#red]-> p5 : no match ❌
  s2 -[#red]-> p6 : no match ❌
}

@enduml

修复

1. 修改 kustomization.yaml

所有实例的 includeSelectors 改为 true

labels:
  - pairs:
      instance: claude
    includeSelectors: true  # 加到 selector 和 pod labels

2. 重建 Deployment

Deployment 的 spec.selector.matchLabelsimmutable 的,不能直接更新,必须先删后建:

# 删除旧 deployment
kubectl -n litellm delete deployment litellm claude-litellm yxaiapp-litellm

# 重新 apply
kubectl apply -k deploy/instances/kelly
kubectl apply -k deploy/instances/claude
kubectl apply -k deploy/instances/yxaiapp

3. 验证

$ kubectl -n litellm get pods --show-labels
NAME                          LABELS
claude-litellm-xxx            app=litellm,instance=claude
litellm-xxx                   app=litellm,instance=kelly
yxaiapp-litellm-xxx           app=litellm,instance=yxaiapp

$ kubectl -n litellm get endpoints
NAME              ENDPOINTS
claude-litellm    172.16.0.155:4000   # ✅ 只有 1 个
litellm           172.16.0.154:4000   # ✅ 只有 1 个
yxaiapp-litellm   172.16.0.240:4000   # ✅ 只有 1 个

每个 Service 现在只路由到自己的 Pod。

教训

Kustomize labels 的 includeSelectors 默认行为

includeSelectors: false 是 Kustomize labels transformer 的默认值。它的设计意图是「只打标签,不影响选择逻辑」,适用于:

  • 给资源打 metadata 标签做分类/查询
  • 不需要影响 Service → Pod 的路由关系

但在同一 namespace 部署多个同类应用实例的场景下,必须用 includeSelectors: true,否则所有 Service 会共享 selector,导致跨实例 round-robin。

快速检查方法

如果你怀疑有类似问题,直接看 endpoints 数量:

kubectl get endpoints -n <namespace>

每个 Service 的 endpoint 数量应该等于对应 Deployment 的 replicas 数。如果 endpoint 数量异常多,大概率是 selector 太宽泛了。

为什么 namePrefix 不够

Kustomize 的 namePrefix 只改资源名(Deployment、Service、ConfigMap 等),不改 label selector。资源名不同不代表 selector 不同——selector 才是决定 Service 路由的关键。

AI Smart Recommendations
Based on Semantic Similarity

AI is analyzing article content to find similar articles...

More Articles

View more exciting content

About Blog

Tech sharing and life insights