Skip to search

ModelRouter

inference.llmkube.dev / v1alpha1

apiVersion: inference.llmkube.dev/v1alpha1 kind: ModelRouter metadata: name: example
View raw schema
apiVersion string
APIVersion defines the versioned schema of this representation of an object. Servers should convert recognized schemas to the latest internal value, and may reject unrecognized values. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources
kind string
Kind is a string value representing the REST resource this object represents. Servers may infer this from the endpoint the client submits requests to. Cannot be updated. In CamelCase. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds
metadata object
spec object required
spec defines the desired state of ModelRouter
backends []object required
Backends are the candidate destinations the router can dispatch to. Order is not significant; selection is rule-driven. At least one backend must be declared.
minItems: 1
capabilities []string
Capabilities advertised by this backend. Rules can require capabilities (e.g. ["tools", "vision", "long-context"]) to filter candidates.
costPerMillionTokens object
CostPerMillionTokens is informational. Used for cost-aware routing metrics and audit-log enrichment. Values are USD.
completionUSD string
CompletionUSD is the cost per million completion (output) tokens, in USD.
pattern: ^[0-9]+(\.[0-9]+)?$
promptUSD string
PromptUSD is the cost per million prompt (input) tokens, in USD.
pattern: ^[0-9]+(\.[0-9]+)?$
external object
External describes an out-of-cluster provider (Anthropic, OpenAI, or a LiteLLM proxy). Mutually exclusive with InferenceServiceRef.
credentialsSecretRef object
CredentialsSecretRef points to a Kubernetes Secret containing the provider credentials. Conventional keys: ANTHROPIC_API_KEY, OPENAI_API_KEY, LITELLM_MASTER_KEY. The router-proxy reads these as environment variables.
name string
Name of the referent. This field is effectively required, but due to backwards compatibility is allowed to be empty. Instances of this type with an empty value here are almost certainly wrong. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#names
model string required
Model is the upstream model identifier passed through to the provider (e.g. "claude-opus-4-7", "gpt-5", a LiteLLM model alias).
provider string required
Provider identifies the upstream API surface. For "litellm", URL must point at a running LiteLLM proxy speaking OpenAI-compatible API. For first-party providers, URL is optional (provider defaults apply).
enum: anthropic, openai, bedrock, vertex_ai, litellm
url string
URL is the base URL for the provider. Required for "litellm"; optional for first-party providers, which use their published default.
healthCheck object
HealthCheck overrides the default health probe applied to this backend by the router-proxy.
intervalSeconds integer
IntervalSeconds is how often the router-proxy probes the backend.
format: int32
minimum: 1
path string
Path is the HTTP path probed for health. Defaults to "/health" for local backends and to the provider's documented health route for external providers.
timeoutSeconds integer
TimeoutSeconds is the maximum time a single probe may take.
format: int32
minimum: 1
inferenceServiceRef object
InferenceServiceRef references an in-cluster InferenceService. Mutually exclusive with External.
name string
Name of the referent. This field is effectively required, but due to backwards compatibility is allowed to be empty. Instances of this type with an empty value here are almost certainly wrong. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#names
name string required
Name is the stable identifier used by rules and observability labels. Must be lowercase alphanumeric or '-'.
pattern: ^[a-z0-9][a-z0-9-]{0,62}$
tier string
Tier classifies the backend for rule matching. "local" backends are served from inside the cluster; "cloud" backends egress the cluster boundary. Fail-closed rules can only route to local-tier backends.
enum: local, cloud
timeout string
Timeout caps how long the proxy waits for this backend to begin sending response headers. When set it overrides the proxy default for dispatches that target this backend. Resolution order at dispatch time: rule.timeout || backend.timeout || proxy default (ModelRouter.spec.proxy.responseHeaderTimeout). Useful when backends in the same router have wildly different P99 envelopes (in-cluster vLLM vs Anthropic global LB).
weight integer
Weight is used for the "weighted" routing strategy. Higher values receive proportionally more traffic. Ignored for other strategies. Default 1 when unset.
format: int32
minimum: 0
defaultRoute string
DefaultRoute names a backend used when no rule matches. Must reference the Name of an entry in Backends.
endpoint object
Endpoint defines the Kubernetes Service the router-proxy is exposed through. Mirrors the shape used by InferenceService.
path string
Path is the HTTP path for the inference endpoint
port integer
Port is the service port
format: int32
minimum: 1
maximum: 65535
type string
Type is the Kubernetes service type (ClusterIP, NodePort, LoadBalancer)
enum: ClusterIP, NodePort, LoadBalancer
mcpServer object
MCPServer optionally exposes this router as a Model Context Protocol endpoint. Inactive until the Phase 3 MCP feature lands; the field is reserved in the schema for forward compatibility.
enabled boolean
Enabled toggles MCP exposure. Default false. When true (after Phase 3 lands), the router-proxy serves an MCP endpoint at /mcp using Streamable HTTP transport and OAuth 2.1.
policy object
Policy holds cross-cutting controls (budgets, classification, audit).
auditLog object
AuditLog controls structured audit emission. Auditing is always on; this field tunes the destination and verbosity.
filePath string
FilePath is the destination when Sink=file. Must be writable inside the router-proxy container. Defaults to "/var/log/mlx-router/audit.log".
includeRequestBody boolean
IncludeRequestBody, when true, includes the OpenAI request body in every audit entry. Disabled by default for size and privacy.
sink string
Sink selects the audit-log destination. "stdout" (default) emits one JSON object per line to the proxy container stdout, where it can be collected by the cluster log stack. "file" writes to FilePath inside the proxy container. "otlp" forwards entries to an OTel collector as log records.
enum: stdout, file, otlp
budgets []object
Budgets caps token and dollar consumption per scope over a rolling window. Empty list means no budget enforcement.
headerKey string
HeaderKey is the request header carrying the team identifier when Scope=team. Defaults to "x-llmkube-team".
maxTokens integer
MaxTokens caps total tokens (prompt + completion) over the window. Either MaxTokens or MaxUSD (or both) must be set.
format: int64
minimum: 1
maxUSD string
MaxUSD caps total estimated cost in USD over the window. Cost is computed from RouterBackend.CostPerMillionTokens.
pattern: ^[0-9]+(\.[0-9]+)?$
name string required
Name identifies this budget for metrics, status, and audit logs.
pattern: ^[a-z0-9][a-z0-9-]{0,62}$
ruleName string
RuleName is required when Scope=rule. References a RouterRule.Name.
scope string required
Scope determines what the budget applies to. "router" caps all traffic through this ModelRouter. "rule" caps traffic matching a single named rule (see RuleName). "team" caps traffic identified by a request header (see HeaderKey).
enum: router, rule, team
windowSeconds integer
WindowSeconds is the rolling window over which the cap is evaluated.
format: int32
minimum: 1
classification object
Classification configures how the router determines the data classification of an inbound request.
headerKey string
HeaderKey is the request header carrying the classification. Defaults to "x-llmkube-classification".
mode string
Mode determines how the router determines a request's classification. "header-only" (default) trusts the request header (HeaderKey, defaults to x-llmkube-classification). "detector" runs the bundled in-proxy detector. "hybrid" prefers the header, falling back to the detector when no header is present.
enum: header-only, detector, hybrid
sensitiveClassifications []string
SensitiveClassifications are the classification values that trigger fail-closed validation: any rule matching one of these values must have FailClosed=true and reference only local-tier backends. Defaults to ["pii", "phi"].
proxy object
Proxy configures the managed router-proxy Deployment (replicas, image override for air-gapped sites, resources). Sensible defaults apply when omitted.
image string
Image overrides the default router-proxy container image. Useful for air-gapped clusters that pin to an internal registry digest.
imagePullSecrets []object
ImagePullSecrets are passed through to the router-proxy pod spec.
name string
Name of the referent. This field is effectively required, but due to backwards compatibility is allowed to be empty. Instances of this type with an empty value here are almost certainly wrong. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#names
quarantineDuration string
QuarantineDuration controls how long the proxy keeps a backend in the "skip" state after a 5xx or network error before allowing a half-open probe. Default 15s when unset. Shorter windows make the proxy recover faster from transient blips; longer windows reduce probe load on genuinely-down upstreams. Tests can shrink this to sub-second to verify recovery without sleeping the full default.
replicas integer
Replicas is the desired number of router-proxy pods. Defaults to 1. The proxy is stateless for routing decisions; budget and SLO counters live in memory and reset on pod restart until the persistence feature lands.
format: int32
minimum: 1
maximum: 10
resources object
Resources sets the pod's compute resource requests and limits.
claims []object
Claims lists the names of resources, defined in spec.resourceClaims, that are used by this container. This field depends on the DynamicResourceAllocation feature gate. This field is immutable. It can only be set for containers.
name string required
Name must match the name of one entry in pod.spec.resourceClaims of the Pod where this field is used. It makes that resource available inside a container.
request string
Request is the name chosen for a request in the referenced claim. If empty, everything from the claim is made available, otherwise only the result of this request.
limits object
Limits describes the maximum amount of compute resources allowed. More info: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/
requests object
Requests describes the minimum amount of compute resources required. If Requests is omitted for a container, it defaults to Limits if that is explicitly specified, otherwise to an implementation-defined value. Requests cannot exceed Limits. More info: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/
responseHeaderTimeout string
ResponseHeaderTimeout caps how long the proxy waits for the upstream to begin sending response headers. For non-streaming chat completions this is effectively a max-generation-time cap; for streaming dispatches the first SSE chunk arrives well inside the window so the cap is invisible. Default 120s when unset. Per-rule and per-backend timeouts (see RouterRule.Timeout and RouterBackend.Timeout) tighten this on a per-request basis but cannot extend it beyond this cap.
rules []object
Rules are evaluated in declaration order. The first matching rule wins. If no rule matches, DefaultRoute is used. If neither a matching rule nor DefaultRoute is set, the request is rejected with HTTP 503.
failClosed boolean
FailClosed: when true, if no backend in Route.Backends is healthy or otherwise eligible, the router rejects the request with HTTP 503 rather than falling through to DefaultRoute or subsequent rules. This is the regulated-data gate: a fail-closed rule guarantees that matched requests are never served by any other route.
match object
Match groups all match conditions. All declared conditions must be true for the rule to fire (AND semantics). If Match is omitted the rule always matches (useful as a catch-all before DefaultRoute).
dataClassification []string
DataClassification matches if the inbound request carries any of these classifications. The classification source depends on Policy.Classification.Mode: a request header (x-llmkube-classification by default), the bundled detector, or both. Common values: "public", "internal", "confidential", "pii", "phi".
headers object
Headers performs exact-match equality on inbound HTTP headers (case-insensitive header name comparison).
latencySLOMs integer
LatencySLOMs is a P95 first-token-latency target in milliseconds. When set, if the rolling P95 for the primary backend exceeds this value the rule promotes its declared fallback. Honored only by the "primary-fallback" strategy.
format: int32
minimum: 1
models []string
Models matches against the OpenAI-style "model" field in the request body. Glob patterns are supported (e.g. "qwen3-*").
requiredCapabilities []string
RequiredCapabilities filters backends. The rule only matches if at least one backend in Route.Backends advertises every listed capability.
taskComplexity string
TaskComplexity matches the inbound complexity hint (header x-llmkube-task-complexity).
enum: simple, moderate, complex
name string required
Name is used in audit logs and metrics labels.
pattern: ^[a-z0-9][a-z0-9-]{0,62}$
route object required
Route is the action taken when this rule matches.
backends []string required
Backends is an ordered list of RouterBackend.Name values. For the "primary-fallback" strategy, the first entry is the primary and subsequent entries are tried in order on failure. For "weighted", traffic is distributed across all entries by Backend.Weight. For "shadow", the first entry serves the response and subsequent entries receive mirrored requests for evaluation only.
minItems: 1
strategy string
Strategy selects how multiple backends are used.
enum: primary-fallback, weighted, shadow
timeout string
Timeout caps how long the proxy waits for the upstream to begin sending response headers on dispatches matched by this rule. When set it overrides RouterBackend.Timeout and the proxy default. Resolution order at dispatch time: rule.timeout || backend.timeout || proxy default. Useful for tightening regulated-data rules (sub-10s strict fail-fast) or extending long-reasoning rules (120s+).
status object
status defines the observed state of ModelRouter
activeRules integer
ActiveRules is the count of rules that successfully validated against current backend state.
format: int32
backends []object
Backends reports the resolved address and current health of every declared backend.
address string
Address is the resolved upstream URL the router-proxy dispatches to. For local backends this is the InferenceService's cluster URL; for external backends it is the provider's base URL.
healthy boolean
Healthy reflects the most recent probe result.
lastProbeTime string
LastProbeTime is when the proxy last completed a health probe for this backend.
format: date-time
message string
Message provides extra context, especially when Healthy is false (e.g. "InferenceService not Ready", "Secret missing key ANTHROPIC_API_KEY").
name string required
Name matches RouterBackend.Name.
tier string
Tier mirrors RouterBackend.Tier for convenience.
budgetUtilization []object
BudgetUtilization summarises current budget consumption.
name string required
Name matches BudgetSpec.Name.
tokensUsed integer
TokensUsed is the rolling-window token count.
format: int64
usdUsed string
USDUsed is the rolling-window estimated cost in USD.
utilization string
Utilization is the fraction of the budget consumed, 0.0 to 1.0. When both MaxTokens and MaxUSD are set this is the maximum of the two utilizations.
conditions []object
conditions represent the current state of the ModelRouter resource. Standard condition types: - "Validated": the spec passed static validation - "BackendsReady": all referenced backends are reachable and healthy - "Available": the router-proxy is serving traffic - "Degraded": at least one backend is unhealthy but the router can still serve other routes
lastTransitionTime string required
lastTransitionTime is the last time the condition transitioned from one status to another. This should be when the underlying condition changed. If that is not known, then using the time when the API field changed is acceptable.
format: date-time
message string required
message is a human readable message indicating details about the transition. This may be an empty string.
maxLength: 32768
observedGeneration integer
observedGeneration represents the .metadata.generation that the condition was set based upon. For instance, if .metadata.generation is currently 12, but the .status.conditions[x].observedGeneration is 9, the condition is out of date with respect to the current state of the instance.
format: int64
minimum: 0
reason string required
reason contains a programmatic identifier indicating the reason for the condition's last transition. Producers of specific condition types may define expected values and meanings for this field, and whether the values are considered a guaranteed API. The value should be a CamelCase string. This field may not be empty.
pattern: ^[A-Za-z]([A-Za-z0-9_,:]*[A-Za-z0-9_])?$
minLength: 1
maxLength: 1024
status string required
status of the condition, one of True, False, Unknown.
enum: True, False, Unknown
type string required
type of condition in CamelCase or in foo.example.com/CamelCase.
pattern: ^([a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*/)?(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])$
maxLength: 316
endpoint string
Endpoint is the in-cluster URL clients should hit. Populated once the router-proxy Service is ready.
lastUpdated string
LastUpdated is the timestamp of the last status reconciliation.
format: date-time
phase string
Phase is a coarse summary of the router's state. Possible values: Pending, Provisioning, Ready, Degraded, Failed.
enum: Pending, Provisioning, Ready, Degraded, Failed

No matches. Try .spec.backends for an exact path