ModelRouter
inference.llmkube.dev / v1alpha1
apiVersion: inference.llmkube.dev/v1alpha1
kind: ModelRouter
metadata:
name: example
apiVersion
string
APIVersion defines the versioned schema of this representation of an object.
Servers should convert recognized schemas to the latest internal value, and
may reject unrecognized values.
More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources
kind
string
Kind is a string value representing the REST resource this object represents.
Servers may infer this from the endpoint the client submits requests to.
Cannot be updated.
In CamelCase.
More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds
metadata
object
spec object required
spec defines the desired state of ModelRouter
backends []object required
Backends are the candidate destinations the router can dispatch to.
Order is not significant; selection is rule-driven. At least one
backend must be declared.
minItems:
1
capabilities
[]string
Capabilities advertised by this backend. Rules can require
capabilities (e.g. ["tools", "vision", "long-context"]) to filter
candidates.
costPerMillionTokens object
CostPerMillionTokens is informational. Used for cost-aware routing
metrics and audit-log enrichment. Values are USD.
completionUSD
string
CompletionUSD is the cost per million completion (output) tokens,
in USD.
pattern:
^[0-9]+(\.[0-9]+)?$
promptUSD
string
PromptUSD is the cost per million prompt (input) tokens, in USD.
pattern:
^[0-9]+(\.[0-9]+)?$external object
External describes an out-of-cluster provider (Anthropic, OpenAI,
or a LiteLLM proxy). Mutually exclusive with InferenceServiceRef.
credentialsSecretRef object
CredentialsSecretRef points to a Kubernetes Secret containing the
provider credentials. Conventional keys: ANTHROPIC_API_KEY,
OPENAI_API_KEY, LITELLM_MASTER_KEY. The router-proxy reads these as
environment variables.
name
string
Name of the referent.
This field is effectively required, but due to backwards compatibility is
allowed to be empty. Instances of this type with an empty value here are
almost certainly wrong.
More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#names
model
string required
Model is the upstream model identifier passed through to the
provider (e.g. "claude-opus-4-7", "gpt-5", a LiteLLM model alias).
provider
string required
Provider identifies the upstream API surface. For "litellm", URL must
point at a running LiteLLM proxy speaking OpenAI-compatible API.
For first-party providers, URL is optional (provider defaults apply).
enum:
anthropic, openai, bedrock, vertex_ai, litellm
url
string
URL is the base URL for the provider. Required for "litellm";
optional for first-party providers, which use their published default.
healthCheck object
HealthCheck overrides the default health probe applied to this
backend by the router-proxy.
intervalSeconds
integer
IntervalSeconds is how often the router-proxy probes the backend.
format:
int32minimum:
1
path
string
Path is the HTTP path probed for health. Defaults to "/health" for
local backends and to the provider's documented health route for
external providers.
timeoutSeconds
integer
TimeoutSeconds is the maximum time a single probe may take.
format:
int32minimum:
1inferenceServiceRef object
InferenceServiceRef references an in-cluster InferenceService.
Mutually exclusive with External.
name
string
Name of the referent.
This field is effectively required, but due to backwards compatibility is
allowed to be empty. Instances of this type with an empty value here are
almost certainly wrong.
More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#names
name
string required
Name is the stable identifier used by rules and observability labels.
Must be lowercase alphanumeric or '-'.
pattern:
^[a-z0-9][a-z0-9-]{0,62}$
tier
string
Tier classifies the backend for rule matching. "local" backends are
served from inside the cluster; "cloud" backends egress the cluster
boundary. Fail-closed rules can only route to local-tier backends.
enum:
local, cloud
timeout
string
Timeout caps how long the proxy waits for this backend to begin
sending response headers. When set it overrides the proxy
default for dispatches that target this backend. Resolution
order at dispatch time: rule.timeout || backend.timeout ||
proxy default (ModelRouter.spec.proxy.responseHeaderTimeout).
Useful when backends in the same router have wildly different
P99 envelopes (in-cluster vLLM vs Anthropic global LB).
weight
integer
Weight is used for the "weighted" routing strategy. Higher values
receive proportionally more traffic. Ignored for other strategies.
Default 1 when unset.
format:
int32minimum:
0
defaultRoute
string
DefaultRoute names a backend used when no rule matches.
Must reference the Name of an entry in Backends.
endpoint object
Endpoint defines the Kubernetes Service the router-proxy is exposed
through. Mirrors the shape used by InferenceService.
path
string
Path is the HTTP path for the inference endpoint
port
integer
Port is the service port
format:
int32minimum:
1maximum:
65535
type
string
Type is the Kubernetes service type (ClusterIP, NodePort, LoadBalancer)
enum:
ClusterIP, NodePort, LoadBalancermcpServer object
MCPServer optionally exposes this router as a Model Context Protocol
endpoint. Inactive until the Phase 3 MCP feature lands; the field is
reserved in the schema for forward compatibility.
enabled
boolean
Enabled toggles MCP exposure. Default false. When true (after Phase
3 lands), the router-proxy serves an MCP endpoint at /mcp using
Streamable HTTP transport and OAuth 2.1.
policy object
Policy holds cross-cutting controls (budgets, classification, audit).
auditLog object
AuditLog controls structured audit emission. Auditing is always on;
this field tunes the destination and verbosity.
filePath
string
FilePath is the destination when Sink=file. Must be writable inside
the router-proxy container. Defaults to "/var/log/mlx-router/audit.log".
includeRequestBody
boolean
IncludeRequestBody, when true, includes the OpenAI request body in
every audit entry. Disabled by default for size and privacy.
sink
string
Sink selects the audit-log destination.
"stdout" (default) emits one JSON object per line to the proxy
container stdout, where it can be collected by the cluster log
stack.
"file" writes to FilePath inside the proxy container.
"otlp" forwards entries to an OTel collector as log records.
enum:
stdout, file, otlpbudgets []object
Budgets caps token and dollar consumption per scope over a rolling
window. Empty list means no budget enforcement.
headerKey
string
HeaderKey is the request header carrying the team identifier when
Scope=team. Defaults to "x-llmkube-team".
maxTokens
integer
MaxTokens caps total tokens (prompt + completion) over the window.
Either MaxTokens or MaxUSD (or both) must be set.
format:
int64minimum:
1
maxUSD
string
MaxUSD caps total estimated cost in USD over the window. Cost is
computed from RouterBackend.CostPerMillionTokens.
pattern:
^[0-9]+(\.[0-9]+)?$
name
string required
Name identifies this budget for metrics, status, and audit logs.
pattern:
^[a-z0-9][a-z0-9-]{0,62}$
ruleName
string
RuleName is required when Scope=rule. References a RouterRule.Name.
scope
string required
Scope determines what the budget applies to.
"router" caps all traffic through this ModelRouter.
"rule" caps traffic matching a single named rule (see RuleName).
"team" caps traffic identified by a request header (see HeaderKey).
enum:
router, rule, team
windowSeconds
integer
WindowSeconds is the rolling window over which the cap is evaluated.
format:
int32minimum:
1classification object
Classification configures how the router determines the data
classification of an inbound request.
headerKey
string
HeaderKey is the request header carrying the classification.
Defaults to "x-llmkube-classification".
mode
string
Mode determines how the router determines a request's
classification.
"header-only" (default) trusts the request header
(HeaderKey, defaults to x-llmkube-classification).
"detector" runs the bundled in-proxy detector.
"hybrid" prefers the header, falling back to the detector when no
header is present.
enum:
header-only, detector, hybrid
sensitiveClassifications
[]string
SensitiveClassifications are the classification values that trigger
fail-closed validation: any rule matching one of these values must
have FailClosed=true and reference only local-tier backends.
Defaults to ["pii", "phi"].
proxy object
Proxy configures the managed router-proxy Deployment (replicas,
image override for air-gapped sites, resources). Sensible defaults
apply when omitted.
image
string
Image overrides the default router-proxy container image. Useful
for air-gapped clusters that pin to an internal registry digest.
imagePullSecrets []object
ImagePullSecrets are passed through to the router-proxy pod spec.
name
string
Name of the referent.
This field is effectively required, but due to backwards compatibility is
allowed to be empty. Instances of this type with an empty value here are
almost certainly wrong.
More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#names
quarantineDuration
string
QuarantineDuration controls how long the proxy keeps a backend in
the "skip" state after a 5xx or network error before allowing a
half-open probe. Default 15s when unset. Shorter windows make the
proxy recover faster from transient blips; longer windows reduce
probe load on genuinely-down upstreams. Tests can shrink this to
sub-second to verify recovery without sleeping the full default.
replicas
integer
Replicas is the desired number of router-proxy pods. Defaults to 1.
The proxy is stateless for routing decisions; budget and SLO
counters live in memory and reset on pod restart until the
persistence feature lands.
format:
int32minimum:
1maximum:
10resources object
Resources sets the pod's compute resource requests and limits.
claims []object
Claims lists the names of resources, defined in spec.resourceClaims,
that are used by this container.
This field depends on the
DynamicResourceAllocation feature gate.
This field is immutable. It can only be set for containers.
name
string required
Name must match the name of one entry in pod.spec.resourceClaims of
the Pod where this field is used. It makes that resource available
inside a container.
request
string
Request is the name chosen for a request in the referenced claim.
If empty, everything from the claim is made available, otherwise
only the result of this request.
limits
object
Limits describes the maximum amount of compute resources allowed.
More info: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/
requests
object
Requests describes the minimum amount of compute resources required.
If Requests is omitted for a container, it defaults to Limits if that is explicitly specified,
otherwise to an implementation-defined value. Requests cannot exceed Limits.
More info: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/
responseHeaderTimeout
string
ResponseHeaderTimeout caps how long the proxy waits for the
upstream to begin sending response headers. For non-streaming
chat completions this is effectively a max-generation-time
cap; for streaming dispatches the first SSE chunk arrives well
inside the window so the cap is invisible. Default 120s when
unset. Per-rule and per-backend timeouts (see RouterRule.Timeout
and RouterBackend.Timeout) tighten this on a per-request basis
but cannot extend it beyond this cap.
rules []object
Rules are evaluated in declaration order. The first matching rule wins.
If no rule matches, DefaultRoute is used. If neither a matching rule
nor DefaultRoute is set, the request is rejected with HTTP 503.
failClosed
boolean
FailClosed: when true, if no backend in Route.Backends is healthy
or otherwise eligible, the router rejects the request with HTTP 503
rather than falling through to DefaultRoute or subsequent rules.
This is the regulated-data gate: a fail-closed rule guarantees that
matched requests are never served by any other route.
match object
Match groups all match conditions. All declared conditions must be
true for the rule to fire (AND semantics). If Match is omitted the
rule always matches (useful as a catch-all before DefaultRoute).
dataClassification
[]string
DataClassification matches if the inbound request carries any of
these classifications. The classification source depends on
Policy.Classification.Mode: a request header
(x-llmkube-classification by default), the bundled detector, or
both. Common values: "public", "internal", "confidential", "pii",
"phi".
headers
object
Headers performs exact-match equality on inbound HTTP headers
(case-insensitive header name comparison).
latencySLOMs
integer
LatencySLOMs is a P95 first-token-latency target in milliseconds.
When set, if the rolling P95 for the primary backend exceeds this
value the rule promotes its declared fallback. Honored only by the
"primary-fallback" strategy.
format:
int32minimum:
1
models
[]string
Models matches against the OpenAI-style "model" field in the
request body. Glob patterns are supported (e.g. "qwen3-*").
requiredCapabilities
[]string
RequiredCapabilities filters backends. The rule only matches if at
least one backend in Route.Backends advertises every listed
capability.
taskComplexity
string
TaskComplexity matches the inbound complexity hint (header
x-llmkube-task-complexity).
enum:
simple, moderate, complex
name
string required
Name is used in audit logs and metrics labels.
pattern:
^[a-z0-9][a-z0-9-]{0,62}$route object required
Route is the action taken when this rule matches.
backends
[]string required
Backends is an ordered list of RouterBackend.Name values. For the
"primary-fallback" strategy, the first entry is the primary and
subsequent entries are tried in order on failure. For "weighted",
traffic is distributed across all entries by Backend.Weight. For
"shadow", the first entry serves the response and subsequent entries
receive mirrored requests for evaluation only.
minItems:
1
strategy
string
Strategy selects how multiple backends are used.
enum:
primary-fallback, weighted, shadow
timeout
string
Timeout caps how long the proxy waits for the upstream to begin
sending response headers on dispatches matched by this rule.
When set it overrides RouterBackend.Timeout and the proxy
default. Resolution order at dispatch time:
rule.timeout || backend.timeout || proxy default.
Useful for tightening regulated-data rules (sub-10s strict
fail-fast) or extending long-reasoning rules (120s+).
status object
status defines the observed state of ModelRouter
activeRules
integer
ActiveRules is the count of rules that successfully validated
against current backend state.
format:
int32backends []object
Backends reports the resolved address and current health of every
declared backend.
address
string
Address is the resolved upstream URL the router-proxy dispatches to.
For local backends this is the InferenceService's cluster URL; for
external backends it is the provider's base URL.
healthy
boolean
Healthy reflects the most recent probe result.
lastProbeTime
string
LastProbeTime is when the proxy last completed a health probe for
this backend.
format:
date-time
message
string
Message provides extra context, especially when Healthy is false
(e.g. "InferenceService not Ready", "Secret missing key
ANTHROPIC_API_KEY").
name
string required
Name matches RouterBackend.Name.
tier
string
Tier mirrors RouterBackend.Tier for convenience.
budgetUtilization []object
BudgetUtilization summarises current budget consumption.
name
string required
Name matches BudgetSpec.Name.
tokensUsed
integer
TokensUsed is the rolling-window token count.
format:
int64
usdUsed
string
USDUsed is the rolling-window estimated cost in USD.
utilization
string
Utilization is the fraction of the budget consumed, 0.0 to 1.0.
When both MaxTokens and MaxUSD are set this is the maximum of the
two utilizations.
conditions []object
conditions represent the current state of the ModelRouter resource.
Standard condition types:
- "Validated": the spec passed static validation
- "BackendsReady": all referenced backends are reachable and healthy
- "Available": the router-proxy is serving traffic
- "Degraded": at least one backend is unhealthy but the router
can still serve other routes
lastTransitionTime
string required
lastTransitionTime is the last time the condition transitioned from one status to another.
This should be when the underlying condition changed. If that is not known, then using the time when the API field changed is acceptable.
format:
date-time
message
string required
message is a human readable message indicating details about the transition.
This may be an empty string.
maxLength:
32768
observedGeneration
integer
observedGeneration represents the .metadata.generation that the condition was set based upon.
For instance, if .metadata.generation is currently 12, but the .status.conditions[x].observedGeneration is 9, the condition is out of date
with respect to the current state of the instance.
format:
int64minimum:
0
reason
string required
reason contains a programmatic identifier indicating the reason for the condition's last transition.
Producers of specific condition types may define expected values and meanings for this field,
and whether the values are considered a guaranteed API.
The value should be a CamelCase string.
This field may not be empty.
pattern:
^[A-Za-z]([A-Za-z0-9_,:]*[A-Za-z0-9_])?$minLength:
1maxLength:
1024
status
string required
status of the condition, one of True, False, Unknown.
enum:
True, False, Unknown
type
string required
type of condition in CamelCase or in foo.example.com/CamelCase.
pattern:
^([a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*/)?(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])$maxLength:
316
endpoint
string
Endpoint is the in-cluster URL clients should hit. Populated once
the router-proxy Service is ready.
lastUpdated
string
LastUpdated is the timestamp of the last status reconciliation.
format:
date-time
phase
string
Phase is a coarse summary of the router's state.
Possible values: Pending, Provisioning, Ready, Degraded, Failed.
enum:
Pending, Provisioning, Ready, Degraded, FailedNo matches. Try .spec.backends for an exact path