ModelRouter

inference.llmkube.dev / v1alpha1

apiVersion: inference.llmkube.dev/v1alpha1 kind: ModelRouter metadata: name: example

apiVersion string

APIVersion defines the versioned schema of this representation of an object. Servers should convert recognized schemas to the latest internal value, and may reject unrecognized values. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources

kind string

Kind is a string value representing the REST resource this object represents. Servers may infer this from the endpoint the client submits requests to. Cannot be updated. In CamelCase. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds

metadata object

spec object required

spec defines the desired state of ModelRouter

backends []object required

Backends are the candidate destinations the router can dispatch to. Order is not significant; selection is rule-driven. At least one backend must be declared.

minItems: 1

capabilities []string

Capabilities advertised by this backend. Rules can require capabilities (e.g. ["tools", "vision", "long-context"]) to filter candidates.

costPerMillionTokens object

CostPerMillionTokens is informational. Used for cost-aware routing metrics and audit-log enrichment. Values are USD.

completionUSD string

CompletionUSD is the cost per million completion (output) tokens, in USD.

pattern: ^[0-9]+(\.[0-9]+)?$

promptUSD string

PromptUSD is the cost per million prompt (input) tokens, in USD.

pattern: ^[0-9]+(\.[0-9]+)?$

displayName string

DisplayName is an optional freeform label published as the model id on /v1/models and used by BackendNameMatch to resolve a request's model field to a backend. When unset, Name is used for both purposes (current behavior). This lets the k8s-safe Name differ from the user-facing model identifier (e.g. Name "claude-opus-4" with DisplayName "claude-opus-4-20250514").

external object

External describes an out-of-cluster provider (Anthropic, OpenAI, or a LiteLLM proxy). Mutually exclusive with InferenceServiceRef.

credentialsSecretRef object

CredentialsSecretRef points to a Kubernetes Secret containing the provider credentials. Conventional keys: ANTHROPIC_API_KEY, OPENAI_API_KEY, LITELLM_MASTER_KEY. The router-proxy reads these as environment variables.

name string

Name of the referent. This field is effectively required, but due to backwards compatibility is allowed to be empty. Instances of this type with an empty value here are almost certainly wrong. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#names

model string required

Model is the upstream model identifier passed through to the provider (e.g. "claude-opus-4-7", "gpt-5", a LiteLLM model alias).

provider string required

Provider identifies the upstream API surface. For "litellm", URL must point at a running LiteLLM proxy speaking OpenAI-compatible API. For first-party providers, URL is optional (provider defaults apply).

enum: anthropic, openai, bedrock, vertex_ai, litellm

url string

URL is the base URL for the provider. Required for "litellm"; optional for first-party providers, which use their published default.

healthCheck object

HealthCheck overrides the default health probe applied to this backend by the router-proxy.

intervalSeconds integer

IntervalSeconds is how often the router-proxy probes the backend.

format: int32

minimum: 1

path string

Path is the HTTP path probed for health. Defaults to "/health" for local backends and to the provider's documented health route for external providers.

timeoutSeconds integer

TimeoutSeconds is the maximum time a single probe may take.

format: int32

minimum: 1

inferenceServiceRef object

InferenceServiceRef references an in-cluster InferenceService. Mutually exclusive with External.

name string

name string required

Name is the stable identifier used by rules and observability labels. Must be lowercase alphanumeric or '-'.

pattern: ^[a-z0-9][a-z0-9-]{0,62}$

tier string

Tier classifies the backend for rule matching. "local" backends are served from inside the cluster; "cloud" backends egress the cluster boundary. Fail-closed rules can only route to local-tier backends.

enum: local, cloud

timeout string

Timeout caps how long the proxy waits for this backend to begin sending response headers. When set it overrides the proxy default for dispatches that target this backend. Resolution order at dispatch time: rule.timeout || backend.timeout || proxy default (ModelRouter.spec.proxy.responseHeaderTimeout). Useful when backends in the same router have wildly different P99 envelopes (in-cluster vLLM vs Anthropic global LB).

weight integer

Weight is used for the "weighted" routing strategy. Higher values receive proportionally more traffic. Ignored for other strategies. Default 1 when unset.

format: int32

minimum: 0

dataPlane string

DataPlane selects how this ModelRouter serves traffic. "Proxy" (default) provisions the managed router-proxy Deployment + Service and routes in-process per the rules below (today's behavior, fully back-compat). "Gateway" compiles the backends and rules onto a pre-installed Envoy AI Gateway: a Backend + AIServiceBackend per InferenceServiceRef backend, a multi-rule AIGatewayRoute, and a retry/failover BackendTrafficPolicy. In Gateway mode the router-proxy is NOT provisioned. Requires the aigw CRDs to be installed; when they are absent the gateway resources are not generated and a condition explains why.

enum: Proxy, Gateway

defaultRoute string

DefaultRoute names a backend used when no rule matches. Must reference the Name of an entry in Backends.

defaultRouteStrategy string

DefaultRouteStrategy decides what happens when no rule matches. "Static" (default) routes to DefaultRoute. "BackendNameMatch" first tries to match the request's model to a backend Name, falling back to DefaultRoute only if none matches. BackendNameMatch makes every backend, including cloud-tier ones, directly addressable by name; sensitive-data protection still relies on a matching failClosed rule, which is evaluated first and therefore continues to gate before any name match.

enum: Static, BackendNameMatch

endpoint object

Endpoint defines the Kubernetes Service the router-proxy is exposed through. Mirrors the shape used by InferenceService.

gateway object

Gateway opts this InferenceService into Envoy AI Gateway exposure. When set and Enabled, the operator generates the Backend / AIServiceBackend / AIGatewayRoute resources that front this service through a pre-installed Envoy AI Gateway. nil (the default) preserves today's behavior (no gateway resources). The Envoy AI Gateway stack and the referenced Gateway are a documented prerequisite; LLMKube does not install or own them.

enabled boolean

Enabled is the opt-in switch. When false (or when Gateway is nil), the operator generates no gateway resources for this InferenceService.

gatewayRef object required

GatewayRef identifies the pre-installed Gateway (gateway.networking.k8s.io) the generated AIGatewayRoute attaches to. The Gateway typically lives in a dedicated gateway namespace; cross-namespace attachment requires the Gateway listener's allowedRoutes.namespaces to permit this InferenceService's namespace (a documented prerequisite for the MVP; the operator does not generate ReferenceGrants or touch the listener).

name string required

Name is the Gateway's name.

namespace string

Namespace is the Gateway's namespace. Empty means the InferenceService's own namespace.

modelName string

ModelName is the OpenAI "model" string clients send, matched by the generated route rule (the x-ai-eg-model header the gateway's ext_proc populates from the request body). Defaults to ModelRef, falling back to the InferenceService name when ModelRef is empty.

nodePort integer

NodePort is the specific NodePort to pin when endpoint.type is NodePort. If set, the Service will use this exact port instead of auto-assigning from the 30000-32767 range. This provides a stable external endpoint across redeployments.

format: int32

minimum: 30000

maximum: 32767

path string

Path is the HTTP path for the inference endpoint

port integer

Port is the service port

format: int32

minimum: 1

maximum: 65535

type string

Type is the Kubernetes service type (ClusterIP, NodePort, LoadBalancer)

enum: ClusterIP, NodePort, LoadBalancer

gatewayRef object

GatewayRef identifies the pre-installed Gateway (gateway.networking.k8s.io) the generated AIGatewayRoute attaches to when DataPlane is "Gateway". Required in Gateway mode; ignored in Proxy mode. The Gateway and the Envoy AI Gateway stack are a documented prerequisite; LLMKube does not install or own them. Cross-namespace attachment requires the Gateway listener's allowedRoutes.namespaces to permit this ModelRouter's namespace.

name string required

Name is the Gateway's name.

namespace string

Namespace is the Gateway's namespace. Empty means the InferenceService's own namespace.

mcpServer object

MCPServer optionally exposes this router as a Model Context Protocol endpoint. Inactive until the Phase 3 MCP feature lands; the field is reserved in the schema for forward compatibility.

enabled boolean

Enabled toggles MCP exposure. Default false. When true (after Phase 3 lands), the router-proxy serves an MCP endpoint at /mcp using Streamable HTTP transport and OAuth 2.1.

policy object

Policy holds cross-cutting controls (budgets, classification, audit).

auditLog object

AuditLog controls structured audit emission. Auditing is always on; this field tunes the destination and verbosity.

filePath string

FilePath is the destination when Sink=file. Must be writable inside the router-proxy container. Defaults to "/var/log/mlx-router/audit.log".

includeRequestBody boolean

IncludeRequestBody, when true, includes the OpenAI request body in every audit entry. Disabled by default for size and privacy.

sink string

Sink selects the audit-log destination. "stdout" (default) emits one JSON object per line to the proxy container stdout, where it can be collected by the cluster log stack. "file" writes to FilePath inside the proxy container. "otlp" forwards entries to an OTel collector as log records.

enum: stdout, file, otlp

auth object

Auth configures request authentication. In dataPlane: Gateway mode it compiles to an Envoy AI Gateway SecurityPolicy that validates inbound JWTs and maps a verified claim onto a trusted header before any model dispatch. nil means no authentication is enforced. Authentication only; per-team model allowlists (authorization) are a separate surface.

allowlists []object

Allowlists restricts which verified team may reach which model (authorization), as a sibling of JWT (authentication). Each entry grants a team the models it may reach. JWT proves identity; Allowlists decide what that identity is permitted to do. Empty or nil means NO authorization is enforced: any authenticated request reaches any model (the authentication-only behavior of slice 2d-core, so adding this field cannot retroactively lock out an existing router). A non-empty Allowlists flips the generated SecurityPolicy to default-Deny: only the named teams (and, per entry, only their listed models) are allowed, and every other verified team is rejected with HTTP 403. Authorization requires authentication: Allowlists set without JWT is rejected fail-loud (you cannot authorize on an unverified claim), as is an entry with an empty Team or a duplicate Team. In dataPlane: Gateway mode these compile to the authorization block of the SAME SecurityPolicy JWT generates: one Allow rule per entry whose principal matches the verified TeamClaim value (and, when the entry lists models, the resolved x-ai-eg-model header).

models []string

Models is the set of model names this team may reach. Empty means the team may reach all models (identity-only allow). Each value matches the resolved model name (the x-ai-eg-model header), the same value spec.rules[].match.models route on.

team string required

Team is the verified teamClaim value this entry grants access to.

minLength: 1

jwt object

JWT enables JSON Web Token validation. When set (in dataPlane: Gateway mode) the gateway rejects requests without a valid token with HTTP 401 before any model dispatch, and maps the configured claim onto a trusted header.

headerKey string

HeaderKey is the request header the verified TeamClaim value lands in. Downstream team-scoped budgets key on this header. Defaults to "x-llmkube-team", matching the budget default.

issuer string required

Issuer is the OIDC issuer URL that must match the token's "iss" claim.

minLength: 1

jwksURI string required

JWKSURI is the remote JWKS endpoint the gateway fetches signing keys from to verify token signatures.

minLength: 1

provider string required

Provider is a short name for the JWT provider (e.g. "keycloak"). It labels the provider in the generated SecurityPolicy.

minLength: 1

teamClaim string required

TeamClaim is the JWT claim that identifies the tenant (e.g. "team"). Its verified value is copied into HeaderKey.

minLength: 1

budgets []object

Budgets caps token and dollar consumption per scope over a rolling window. Empty list means no budget enforcement.

headerKey string

HeaderKey is the request header carrying the team identifier when Scope=team. Defaults to "x-llmkube-team".

maxTokens integer

MaxTokens caps total tokens (prompt + completion) over the window. Either MaxTokens or MaxUSD (or both) must be set.

format: int64

minimum: 1

maxUSD string

MaxUSD caps total estimated cost in USD over the window. Cost is computed from RouterBackend.CostPerMillionTokens.

pattern: ^[0-9]+(\.[0-9]+)?$

name string required

Name identifies this budget for metrics, status, and audit logs.

pattern: ^[a-z0-9][a-z0-9-]{0,62}$

ruleName string

RuleName is required when Scope=rule. References a RouterRule.Name.

scope string required

Scope determines what the budget applies to. "router" caps all traffic through this ModelRouter. "rule" caps traffic matching a single named rule (see RuleName). "team" caps traffic identified by a request header (see HeaderKey).

enum: router, rule, team

windowSeconds integer

WindowSeconds is the rolling window over which the cap is evaluated.

format: int32

minimum: 1

classification object

Classification configures how the router determines the data classification of an inbound request.

headerKey string

HeaderKey is the request header carrying the classification. Defaults to "x-llmkube-classification".

mode string

Mode determines how the router determines a request's classification. "header-only" (default) trusts the request header (HeaderKey, defaults to x-llmkube-classification). "detector" runs the bundled in-proxy detector. "hybrid" prefers the header, falling back to the detector when no header is present.

enum: header-only, detector, hybrid

sensitiveClassifications []string

SensitiveClassifications are the classification values that trigger fail-closed validation: any rule matching one of these values must have FailClosed=true and reference only local-tier backends. Defaults to ["pii", "phi"].

proxy object

Proxy configures the managed router-proxy Deployment (replicas, image override for air-gapped sites, resources). Sensible defaults apply when omitted.

image string

Image overrides the default router-proxy container image. Useful for air-gapped clusters that pin to an internal registry digest.

imagePullSecrets []object

ImagePullSecrets are passed through to the router-proxy pod spec.

name string

quarantineDuration string

QuarantineDuration controls how long the proxy keeps a backend in the "skip" state after a 5xx or network error before allowing a half-open probe. Default 15s when unset. Shorter windows make the proxy recover faster from transient blips; longer windows reduce probe load on genuinely-down upstreams. Tests can shrink this to sub-second to verify recovery without sleeping the full default.

replicas integer

Replicas is the desired number of router-proxy pods. Defaults to 1. The proxy is stateless for routing decisions; budget and SLO counters live in memory and reset on pod restart until the persistence feature lands.

format: int32

minimum: 1

maximum: 10

resources object

Resources sets the pod's compute resource requests and limits.

claims []object

Claims lists the names of resources, defined in spec.resourceClaims, that are used by this container. This field depends on the DynamicResourceAllocation feature gate. This field is immutable. It can only be set for containers.

name string required

Name must match the name of one entry in pod.spec.resourceClaims of the Pod where this field is used. It makes that resource available inside a container.

request string

Request is the name chosen for a request in the referenced claim. If empty, everything from the claim is made available, otherwise only the result of this request.

limits object

Limits describes the maximum amount of compute resources allowed. More info: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/

requests object

Requests describes the minimum amount of compute resources required. If Requests is omitted for a container, it defaults to Limits if that is explicitly specified, otherwise to an implementation-defined value. Requests cannot exceed Limits. More info: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/

responseHeaderTimeout string

ResponseHeaderTimeout caps how long the proxy waits for the upstream to begin sending response headers. For non-streaming chat completions this is effectively a max-generation-time cap; for streaming dispatches the first SSE chunk arrives well inside the window so the cap is invisible. Default 120s when unset. Per-rule and per-backend timeouts (see RouterRule.Timeout and RouterBackend.Timeout) tighten this on a per-request basis but cannot extend it beyond this cap.

revisionHistoryLimit integer

RevisionHistoryLimit caps how many old ReplicaSets the proxy Deployment keeps for rollback. Unset uses the Kubernetes default (10); 0 keeps none. Useful to bound ReplicaSet buildup, since the proxy re-rolls on every config change.

format: int32

minimum: 0

rules []object

Rules are evaluated in declaration order. The first matching rule wins. If no rule matches, the fallback is governed by DefaultRouteStrategy, ending at DefaultRoute. If nothing resolves, the request is rejected with HTTP 503.

failClosed boolean

FailClosed: when true, if no backend in Route.Backends is healthy or otherwise eligible, the router rejects the request with HTTP 503 rather than falling through to DefaultRoute or subsequent rules. This is the regulated-data gate: a fail-closed rule guarantees that matched requests are never served by any other route.

match object

Match groups all match conditions. All declared conditions must be true for the rule to fire (AND semantics). If Match is omitted the rule always matches (useful as a catch-all before DefaultRoute).

dataClassification []string

DataClassification matches if the inbound request carries any of these classifications. The classification source depends on Policy.Classification.Mode: a request header (x-llmkube-classification by default), the bundled detector, or both. Common values: "public", "internal", "confidential", "pii", "phi".

headers object

Headers performs exact-match equality on inbound HTTP headers (case-insensitive header name comparison).

latencySLOMs integer

LatencySLOMs is a P95 first-token-latency target in milliseconds. When set, if the rolling P95 for the primary backend exceeds this value the rule promotes its declared fallback. Honored only by the "primary-fallback" strategy.

format: int32

minimum: 1

models []string

Models matches against the OpenAI-style "model" field in the request body. Glob patterns are supported (e.g. "qwen3-*").

requiredCapabilities []string

RequiredCapabilities filters backends. The rule only matches if at least one backend in Route.Backends advertises every listed capability.

taskComplexity string

TaskComplexity matches the inbound complexity hint (header x-llmkube-task-complexity).

enum: simple, moderate, complex

name string required

Name is used in audit logs and metrics labels.

pattern: ^[a-z0-9][a-z0-9-]{0,62}$

route object required

Route is the action taken when this rule matches.

backends []string required

Backends is an ordered list of RouterBackend.Name values. For the "primary-fallback" strategy, the first entry is the primary and subsequent entries are tried in order on failure. For "weighted", traffic is distributed across all entries by Backend.Weight. For "shadow", the first entry serves the response and subsequent entries receive mirrored requests for evaluation only.

minItems: 1

strategy string

Strategy selects how multiple backends are used.

enum: primary-fallback, weighted, shadow

timeout string

Timeout caps how long the proxy waits for the upstream to begin sending response headers on dispatches matched by this rule. When set it overrides RouterBackend.Timeout and the proxy default. Resolution order at dispatch time: rule.timeout || backend.timeout || proxy default. Useful for tightening regulated-data rules (sub-10s strict fail-fast) or extending long-reasoning rules (120s+).

status object

status defines the observed state of ModelRouter

activeRules integer

ActiveRules is the count of rules that successfully validated against current backend state.

format: int32

backends []object

Backends reports the resolved address and current health of every declared backend.

address string

Address is the resolved upstream URL the router-proxy dispatches to. For local backends this is the InferenceService's cluster URL; for external backends it is the provider's base URL.

healthy boolean

Healthy reflects the most recent probe result.

lastProbeTime string

LastProbeTime is when the proxy last completed a health probe for this backend.

format: date-time

message string

Message provides extra context, especially when Healthy is false (e.g. "InferenceService not Ready", "Secret missing key ANTHROPIC_API_KEY").

name string required

Name matches RouterBackend.Name.

tier string

Tier mirrors RouterBackend.Tier for convenience.

budgetUtilization []object

BudgetUtilization summarises current budget consumption.

name string required

Name matches BudgetSpec.Name.

tokensUsed integer

TokensUsed is the rolling-window token count.

format: int64

usdUsed string

USDUsed is the rolling-window estimated cost in USD.

utilization string

Utilization is the fraction of the budget consumed, 0.0 to 1.0. When both MaxTokens and MaxUSD are set this is the maximum of the two utilizations.

conditions []object

conditions represent the current state of the ModelRouter resource. Standard condition types: - "Validated": the spec passed static validation - "BackendsReady": all referenced backends are reachable and healthy - "Available": the router-proxy is serving traffic - "Degraded": at least one backend is unhealthy but the router can still serve other routes - "GatewayReady": (dataPlane: Gateway) the gateway resources reconciled

lastTransitionTime string required

lastTransitionTime is the last time the condition transitioned from one status to another. This should be when the underlying condition changed. If that is not known, then using the time when the API field changed is acceptable.

format: date-time

message string required

message is a human readable message indicating details about the transition. This may be an empty string.

maxLength: 32768

observedGeneration integer

observedGeneration represents the .metadata.generation that the condition was set based upon. For instance, if .metadata.generation is currently 12, but the .status.conditions[x].observedGeneration is 9, the condition is out of date with respect to the current state of the instance.

format: int64

minimum: 0

reason string required

reason contains a programmatic identifier indicating the reason for the condition's last transition. Producers of specific condition types may define expected values and meanings for this field, and whether the values are considered a guaranteed API. The value should be a CamelCase string. This field may not be empty.

pattern: ^[A-Za-z]([A-Za-z0-9_,:]*[A-Za-z0-9_])?$

minLength: 1

maxLength: 1024

status string required

status of the condition, one of True, False, Unknown.

enum: True, False, Unknown

type string required

type of condition in CamelCase or in foo.example.com/CamelCase.

pattern: ^([a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*/)?(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])$

maxLength: 316

endpoint string

Endpoint is the in-cluster URL clients should hit. Populated once the router-proxy Service is ready.

gateway object

Gateway reports the observed state of dataPlane: Gateway exposure: whether the AIGatewayRoute (and its backing Backend / AIServiceBackend / BackendTrafficPolicy) reconciled, and the resolved gateway endpoint. nil in Proxy mode. Also surfaced via the GatewayReady condition.

authEnabled boolean

AuthEnabled indicates a SecurityPolicy enforcing JWT authentication was compiled for this route (ModelRouter policy.auth.jwt). Set by the ModelRouter dataPlane: Gateway path; false when no auth is configured.

endpoint string

Endpoint is the gateway address clients send OpenAI requests to. Set by the ModelRouter dataPlane: Gateway path (resolved from the referenced Gateway); empty for the InferenceService path.

modelName string

ModelName is the resolved model-name match value clients send as the OpenAI "model" string to reach this InferenceService through the gateway. Set by the InferenceService path; empty for ModelRouter (which fronts many model names).

routeReady boolean

RouteReady indicates the AIGatewayRoute (and its backing Backend + AIServiceBackend) were reconciled successfully against the gateway.

lastUpdated string

LastUpdated is the timestamp of the last status reconciliation.

format: date-time

phase string

Phase is a coarse summary of the router's state. Possible values: Pending, Provisioning, Ready, Degraded, Failed.

enum: Pending, Provisioning, Ready, Degraded, Failed

No matches. Try .spec.backends for an exact path