Skip to search

InferenceService

inference.llmkube.dev / v1alpha1

apiVersion: inference.llmkube.dev/v1alpha1 kind: InferenceService metadata: name: example
View raw schema
apiVersion string
APIVersion defines the versioned schema of this representation of an object. Servers should convert recognized schemas to the latest internal value, and may reject unrecognized values. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources
kind string
Kind is a string value representing the REST resource this object represents. Servers may infer this from the endpoint the client submits requests to. Cannot be updated. In CamelCase. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds
metadata object
spec object required
spec defines the desired state of InferenceService
args []string
Args overrides the container arguments entirely. Only used when Runtime is "generic". For llamacpp, use ExtraArgs instead.
autoscaling object
Autoscaling configures horizontal pod autoscaling for the inference service. When set, the controller creates and manages an HPA resource targeting the inference Deployment. Requires Prometheus Adapter for custom metrics. Mutually exclusive with manual replica management: when autoscaling is enabled, the Replicas field serves as the initial replica count only.
maxReplicas integer required
MaxReplicas is the upper limit for the number of replicas.
format: int32
minimum: 1
maximum: 100
metrics []object
Metrics defines the scaling metrics and target values. If empty, defaults to llamacpp:requests_processing with target average value of 2.
name string required
Name is the metric name (e.g., llamacpp:requests_processing).
targetAverageUtilization integer
TargetAverageUtilization is the target utilization percentage for Resource-type metrics.
format: int32
targetAverageValue string
TargetAverageValue is the target per-pod average for Pods-type metrics.
type string required
Type is the metric source type.
enum: Pods, Resource
minReplicas integer
MinReplicas is the lower limit for the number of replicas.
format: int32
minimum: 1
maximum: 10
batchSize integer
BatchSize sets the token batch size for prompt processing. Larger values improve throughput but use more memory. Maps to llama.cpp --batch-size flag.
format: int32
minimum: 1
maximum: 16384
cacheTypeCustomK string
CacheTypeCustomK sets a custom KV cache type for keys that is not in the standard enum. Used for llama.cpp forks with additional cache formats such as TurboQuant (turbo3, turbo4, tbqp3, etc.). Maps to llama.cpp --cache-type-k. The runtime binary must understand the value or llama-server will fail to start; LLMKube does not validate the string. Takes precedence over CacheTypeK when both are set.
cacheTypeCustomV string
CacheTypeCustomV sets a custom KV cache type for values that is not in the standard enum. See CacheTypeCustomK for usage notes. Takes precedence over CacheTypeV when both are set.
cacheTypeK string
CacheTypeK sets the KV cache quantization type for keys. Supported values depend on the llama.cpp build version. Maps to llama.cpp --cache-type-k flag. Default: f16 (llama.cpp default). For custom build types not in the enum (e.g. TurboQuant turbo3, tbqp3), use CacheTypeCustomK instead.
enum: f16, f32, q8_0, q4_0, q4_1, q5_0, q5_1, iq4_nl
cacheTypeV string
CacheTypeV sets the KV cache quantization type for values. Maps to llama.cpp --cache-type-v flag. Default: f16 (llama.cpp default). For custom build types not in the enum (e.g. TurboQuant turbo3, tbqp3), use CacheTypeCustomV instead.
enum: f16, f32, q8_0, q4_0, q4_1, q5_0, q5_1, iq4_nl
command []string
Command overrides the container entrypoint. Only used when Runtime is "generic" or for advanced customization.
containerPort integer
ContainerPort overrides the primary container port. Each runtime has its own default (llamacpp: 8080).
format: int32
minimum: 1
maximum: 65535
contextSize integer
ContextSize sets the context window size for the llama.cpp server (-c flag). Larger values allow processing longer inputs but require more memory. If not specified, llama.cpp uses its default (typically 512 or 2048). The upper bound covers Qwen 3.6 at 1M-via-YaRN with margin and accommodates near-future hybrid-attention model architectures. KV cache memory is the user's responsibility to size via spec.resources.memory or hostMemory.
format: int32
minimum: 128
maximum: 2.097152e+06
endpoint object
Endpoint defines the service endpoint configuration
path string
Path is the HTTP path for the inference endpoint
port integer
Port is the service port
format: int32
minimum: 1
maximum: 65535
type string
Type is the Kubernetes service type (ClusterIP, NodePort, LoadBalancer)
enum: ClusterIP, NodePort, LoadBalancer
env []object
Env adds environment variables to the inference container. Useful for HF_TOKEN, custom runtime config, etc.
name string required
Name of the environment variable. May consist of any printable ASCII characters except '='.
value string
Variable references $(VAR_NAME) are expanded using the previously defined environment variables in the container and any service environment variables. If a variable cannot be resolved, the reference in the input string will be unchanged. Double $$ are reduced to a single $, which allows for escaping the $(VAR_NAME) syntax: i.e. "$$(VAR_NAME)" will produce the string literal "$(VAR_NAME)". Escaped references will never be expanded, regardless of whether the variable exists or not. Defaults to "".
valueFrom object
Source for the environment variable's value. Cannot be used if value is not empty.
configMapKeyRef object
Selects a key of a ConfigMap.
key string required
The key to select.
name string
Name of the referent. This field is effectively required, but due to backwards compatibility is allowed to be empty. Instances of this type with an empty value here are almost certainly wrong. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#names
optional boolean
Specify whether the ConfigMap or its key must be defined
fieldRef object
Selects a field of the pod: supports metadata.name, metadata.namespace, `metadata.labels['<KEY>']`, `metadata.annotations['<KEY>']`, spec.nodeName, spec.serviceAccountName, status.hostIP, status.podIP, status.podIPs.
apiVersion string
Version of the schema the FieldPath is written in terms of, defaults to "v1".
fieldPath string required
Path of the field to select in the specified API version.
fileKeyRef object
FileKeyRef selects a key of the env file. Requires the EnvFiles feature gate to be enabled.
key string required
The key within the env file. An invalid key will prevent the pod from starting. The keys defined within a source may consist of any printable ASCII characters except '='. During Alpha stage of the EnvFiles feature gate, the key size is limited to 128 characters.
optional boolean
Specify whether the file or its key must be defined. If the file or key does not exist, then the env var is not published. If optional is set to true and the specified key does not exist, the environment variable will not be set in the Pod's containers. If optional is set to false and the specified key does not exist, an error will be returned during Pod creation.
path string required
The path within the volume from which to select the file. Must be relative and may not contain the '..' path or start with '..'.
volumeName string required
The name of the volume mount containing the env file.
resourceFieldRef object
Selects a resource of the container: only resources limits and requests (limits.cpu, limits.memory, limits.ephemeral-storage, requests.cpu, requests.memory and requests.ephemeral-storage) are currently supported.
containerName string
Container name: required for volumes, optional for env vars
divisor string | integer
Specifies the output format of the exposed resources, defaults to "1"
string pattern: ^(\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))(([KMGTPE]i)|[numkMGTPE]|([eE](\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))))?$
resource string required
Required: resource to select
secretKeyRef object
Selects a key of a secret in the pod's namespace
key string required
The key of the secret to select from. Must be a valid secret key.
name string
Name of the referent. This field is effectively required, but due to backwards compatibility is allowed to be empty. Instances of this type with an empty value here are almost certainly wrong. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#names
optional boolean
Specify whether the Secret or its key must be defined
evictionProtection boolean
EvictionProtection marks this service as ineligible for memory-pressure eviction by the metal-agent watchdog. Use this for production workloads that should never be silently stopped under memory pressure, even when they are the lowest-priority option. The agent's per-process pickEvictionTarget excludes protected processes from the eviction-candidate set; the MemoryPressure status condition is still patched on protected services for operator visibility. Has no effect when --eviction-enabled is unset on the metal-agent or for non-llama-server runtimes (oMLX, Ollama). Defaults to false.
extraArgs []string
ExtraArgs provides additional command-line arguments passed directly to the runtime process. Use for flags not yet supported as typed CRD fields. Arguments are appended after all other configured flags. Supported by the "llamacpp" and "vllm" runtimes. Ignored by others. Example: ["--seed", "42", "--log-disable"]
flashAttention boolean
FlashAttention enables flash attention for faster prompt processing and reduced KV cache memory. Maps to llama.cpp --flash-attn flag. On NVIDIA GPUs requires Ampere or newer (compute capability 8.0+). On Apple Silicon (Metal agent path) the default is true when this field is unset, because the wired-collector + flash-attn combination prevents the ~25% decode degradation observed at long context on Qwen-class models running on M-series chips.
image string
Image is the container image for the inference runtime. For llamacpp runtime, defaults to ghcr.io/ggml-org/llama.cpp:server. For generic runtime, this field is required.
imagePullSecrets []object
ImagePullSecrets for pulling container images from private registries.
name string
Name of the referent. This field is effectively required, but due to backwards compatibility is allowed to be empty. Instances of this type with an empty value here are almost certainly wrong. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#names
jinja boolean
Jinja enables Jinja2 chat template rendering for tool/function calling support. Required when using the OpenAI-compatible API with tools. Maps to llama.cpp --jinja flag.
metadataOverrides []string
MetadataOverrides overrides GGUF metadata key-value pairs at model load time. Each entry is passed as a separate --override-kv flag. Format: key=type:value (e.g., "qwen35moe.context_length=int:1048576" to extend context window, or "tokenizer.chat_template.thinking=bool:false" to tweak tokenizer behavior). Maps to llama.cpp --override-kv flag (one flag per entry).
modelRef string required
ModelRef references the Model CR that contains the model to serve
moeCPULayers integer
MoeCPULayers sets the number of MoE layers to offload to CPU. When set, only the specified number of MoE layers run on CPU rather than all. Maps to llama.cpp --n-cpu-moe flag.
format: int32
minimum: 0
moeCPUOffload boolean
MoeCPUOffload offloads all MoE expert layers to CPU for reduced VRAM usage. Enables running large MoE models (e.g., Qwen3-30B, Mixtral) on VRAM-constrained hardware by keeping attention layers on GPU while expert weights use system RAM. Maps to llama.cpp --cpu-moe flag. Requires sufficient system RAM via resources.memory.
noKvOffload boolean
NoKvOffload keeps the KV cache in system RAM instead of VRAM. Useful for extended context windows when VRAM is constrained by model weights. Maps to llama.cpp --no-kv-offload flag. Requires sufficient system RAM via resources.memory.
noWarmup boolean
NoWarmup skips the llama.cpp startup warmup inference pass. Reduces pod ready time at the cost of slightly higher first-request latency. Useful for scale-to-zero and quick redeployment patterns. Maps to llama.cpp --no-warmup flag.
nodeSelector object
NodeSelector for pod placement (e.g., specific node pools)
parallelSlots integer
ParallelSlots sets the number of concurrent request slots for the llama.cpp server (--parallel flag). Each slot processes one request independently; higher values use more KV cache memory. If not specified, the operator omits --parallel and llama.cpp picks an auto value (currently 4).
format: int32
minimum: 1
maximum: 64
personaPlexConfig object
PersonaPlexConfig holds configuration for the PersonaPlex (Moshi) runtime. Only used when Runtime is "personaplex".
cpuOffload boolean
CPUOffload enables model weight offloading to system RAM when GPU VRAM is insufficient. Requires the accelerate package in the container image.
hfTokenSecretRef object
HFTokenSecretRef references a Secret containing the HuggingFace token for model download. The Secret key must be "HF_TOKEN".
key string required
The key of the secret to select from. Must be a valid secret key.
name string
Name of the referent. This field is effectively required, but due to backwards compatibility is allowed to be empty. Instances of this type with an empty value here are almost certainly wrong. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#names
optional boolean
Specify whether the Secret or its key must be defined
quantize4Bit boolean
Quantize4Bit enables NF4 4-bit quantization for reduced VRAM usage (~9.6 GB vs ~14 GB). Requires the bitsandbytes package in the container image.
podAnnotations object
PodAnnotations are merged into the inference Pod's metadata.annotations. Use this to tag Pods for downstream tooling (cost attribution, service mesh routing, custom admission controllers) without those tools needing to know about LLMKube's CRD schema. Pure passthrough; the operator itself does not set any annotations on inference Pods today.
podLabels object
PodLabels are merged into the inference Pod's metadata.labels alongside the operator-managed labels (`app`, `inference.llmkube.dev/model`, `inference.llmkube.dev/service`). Operator-managed keys take precedence on collision so the Deployment selector stays in sync with the Pods it owns. The Deployment selector itself uses only the operator-managed labels and is immutable, so changing PodLabels later is safe.
podSecurityContext object
PodSecurityContext defines pod-level security attributes for inference pods. Use this to set fsGroup for volume permissions (required on OpenShift).
appArmorProfile object
appArmorProfile is the AppArmor options to use by the containers in this pod. Note that this field cannot be set when spec.os.name is windows.
localhostProfile string
localhostProfile indicates a profile loaded on the node that should be used. The profile must be preconfigured on the node to work. Must match the loaded name of the profile. Must be set if and only if type is "Localhost".
type string required
type indicates which kind of AppArmor profile will be applied. Valid options are: Localhost - a profile pre-loaded on the node. RuntimeDefault - the container runtime's default profile. Unconfined - no AppArmor enforcement.
fsGroup integer
A special supplemental group that applies to all containers in a pod. Some volume types allow the Kubelet to change the ownership of that volume to be owned by the pod: 1. The owning GID will be the FSGroup 2. The setgid bit is set (new files created in the volume will be owned by FSGroup) 3. The permission bits are OR'd with rw-rw---- If unset, the Kubelet will not modify the ownership and permissions of any volume. Note that this field cannot be set when spec.os.name is windows.
format: int64
fsGroupChangePolicy string
fsGroupChangePolicy defines behavior of changing ownership and permission of the volume before being exposed inside Pod. This field will only apply to volume types which support fsGroup based ownership(and permissions). It will have no effect on ephemeral volume types such as: secret, configmaps and emptydir. Valid values are "OnRootMismatch" and "Always". If not specified, "Always" is used. Note that this field cannot be set when spec.os.name is windows.
runAsGroup integer
The GID to run the entrypoint of the container process. Uses runtime default if unset. May also be set in SecurityContext. If set in both SecurityContext and PodSecurityContext, the value specified in SecurityContext takes precedence for that container. Note that this field cannot be set when spec.os.name is windows.
format: int64
runAsNonRoot boolean
Indicates that the container must run as a non-root user. If true, the Kubelet will validate the image at runtime to ensure that it does not run as UID 0 (root) and fail to start the container if it does. If unset or false, no such validation will be performed. May also be set in SecurityContext. If set in both SecurityContext and PodSecurityContext, the value specified in SecurityContext takes precedence.
runAsUser integer
The UID to run the entrypoint of the container process. Defaults to user specified in image metadata if unspecified. May also be set in SecurityContext. If set in both SecurityContext and PodSecurityContext, the value specified in SecurityContext takes precedence for that container. Note that this field cannot be set when spec.os.name is windows.
format: int64
seLinuxChangePolicy string
seLinuxChangePolicy defines how the container's SELinux label is applied to all volumes used by the Pod. It has no effect on nodes that do not support SELinux or to volumes does not support SELinux. Valid values are "MountOption" and "Recursive". "Recursive" means relabeling of all files on all Pod volumes by the container runtime. This may be slow for large volumes, but allows mixing privileged and unprivileged Pods sharing the same volume on the same node. "MountOption" mounts all eligible Pod volumes with `-o context` mount option. This requires all Pods that share the same volume to use the same SELinux label. It is not possible to share the same volume among privileged and unprivileged Pods. Eligible volumes are in-tree FibreChannel and iSCSI volumes, and all CSI volumes whose CSI driver announces SELinux support by setting spec.seLinuxMount: true in their CSIDriver instance. Other volumes are always re-labelled recursively. "MountOption" value is allowed only when SELinuxMount feature gate is enabled. If not specified and SELinuxMount feature gate is enabled, "MountOption" is used. If not specified and SELinuxMount feature gate is disabled, "MountOption" is used for ReadWriteOncePod volumes and "Recursive" for all other volumes. This field affects only Pods that have SELinux label set, either in PodSecurityContext or in SecurityContext of all containers. All Pods that use the same volume should use the same seLinuxChangePolicy, otherwise some pods can get stuck in ContainerCreating state. Note that this field cannot be set when spec.os.name is windows.
seLinuxOptions object
The SELinux context to be applied to all containers. If unspecified, the container runtime will allocate a random SELinux context for each container. May also be set in SecurityContext. If set in both SecurityContext and PodSecurityContext, the value specified in SecurityContext takes precedence for that container. Note that this field cannot be set when spec.os.name is windows.
level string
Level is SELinux level label that applies to the container.
role string
Role is a SELinux role label that applies to the container.
type string
Type is a SELinux type label that applies to the container.
user string
User is a SELinux user label that applies to the container.
seccompProfile object
The seccomp options to use by the containers in this pod. Note that this field cannot be set when spec.os.name is windows.
localhostProfile string
localhostProfile indicates a profile defined in a file on the node should be used. The profile must be preconfigured on the node to work. Must be a descending path, relative to the kubelet's configured seccomp profile location. Must be set if type is "Localhost". Must NOT be set for any other type.
type string required
type indicates which kind of seccomp profile will be applied. Valid options are: Localhost - a profile defined in a file on the node should be used. RuntimeDefault - the container runtime default profile should be used. Unconfined - no profile should be applied.
supplementalGroups []integer
A list of groups applied to the first process run in each container, in addition to the container's primary GID and fsGroup (if specified). If the SupplementalGroupsPolicy feature is enabled, the supplementalGroupsPolicy field determines whether these are in addition to or instead of any group memberships defined in the container image. If unspecified, no additional groups are added, though group memberships defined in the container image may still be used, depending on the supplementalGroupsPolicy field. Note that this field cannot be set when spec.os.name is windows.
supplementalGroupsPolicy string
Defines how supplemental groups of the first container processes are calculated. Valid values are "Merge" and "Strict". If not specified, "Merge" is used. (Alpha) Using the field requires the SupplementalGroupsPolicy feature gate to be enabled and the container runtime must implement support for this feature. Note that this field cannot be set when spec.os.name is windows.
sysctls []object
Sysctls hold a list of namespaced sysctls used for the pod. Pods with unsupported sysctls (by the container runtime) might fail to launch. Note that this field cannot be set when spec.os.name is windows.
name string required
Name of a property to set
value string required
Value of a property to set
windowsOptions object
The Windows specific settings applied to all containers. If unspecified, the options within a container's SecurityContext will be used. If set in both SecurityContext and PodSecurityContext, the value specified in SecurityContext takes precedence. Note that this field cannot be set when spec.os.name is linux.
gmsaCredentialSpec string
GMSACredentialSpec is where the GMSA admission webhook (https://github.com/kubernetes-sigs/windows-gmsa) inlines the contents of the GMSA credential spec named by the GMSACredentialSpecName field.
gmsaCredentialSpecName string
GMSACredentialSpecName is the name of the GMSA credential spec to use.
hostProcess boolean
HostProcess determines if a container should be run as a 'Host Process' container. All of a Pod's containers must have the same effective HostProcess value (it is not allowed to have a mix of HostProcess containers and non-HostProcess containers). In addition, if HostProcess is true then HostNetwork must also be set to true.
runAsUserName string
The UserName in Windows to run the entrypoint of the container process. Defaults to the user specified in image metadata if unspecified. May also be set in PodSecurityContext. If set in both SecurityContext and PodSecurityContext, the value specified in SecurityContext takes precedence.
priority string
Priority determines scheduling priority for GPU allocation. Higher priority services can preempt lower priority ones when GPUs are scarce.
enum: critical, high, normal, low, batch
priorityClassName string
PriorityClassName allows specifying a custom Kubernetes PriorityClass. Takes precedence over the Priority field if set.
probeOverrides object
ProbeOverrides allows replacing the auto-generated health probes. Useful for runtimes with non-HTTP health endpoints (e.g., TCP, WebSocket).
liveness object
Liveness overrides the liveness probe.
exec object
Exec specifies a command to execute in the container.
command []string
Command is the command line to execute inside the container, the working directory for the command is root ('/') in the container's filesystem. The command is simply exec'd, it is not run inside a shell, so traditional shell instructions ('|', etc) won't work. To use a shell, you need to explicitly call out to that shell. Exit status of 0 is treated as live/healthy and non-zero is unhealthy.
failureThreshold integer
Minimum consecutive failures for the probe to be considered failed after having succeeded. Defaults to 3. Minimum value is 1.
format: int32
grpc object
GRPC specifies a GRPC HealthCheckRequest.
port integer required
Port number of the gRPC service. Number must be in the range 1 to 65535.
format: int32
service string
Service is the name of the service to place in the gRPC HealthCheckRequest (see https://github.com/grpc/grpc/blob/master/doc/health-checking.md). If this is not specified, the default behavior is defined by gRPC.
httpGet object
HTTPGet specifies an HTTP GET request to perform.
host string
Host name to connect to, defaults to the pod IP. You probably want to set "Host" in httpHeaders instead.
httpHeaders []object
Custom headers to set in the request. HTTP allows repeated headers.
name string required
The header field name. This will be canonicalized upon output, so case-variant names will be understood as the same header.
value string required
The header field value
path string
Path to access on the HTTP server.
port string | integer required
Name or number of the port to access on the container. Number must be in the range 1 to 65535. Name must be an IANA_SVC_NAME.
scheme string
Scheme to use for connecting to the host. Defaults to HTTP.
initialDelaySeconds integer
Number of seconds after the container has started before liveness probes are initiated. More info: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle#container-probes
format: int32
periodSeconds integer
How often (in seconds) to perform the probe. Default to 10 seconds. Minimum value is 1.
format: int32
successThreshold integer
Minimum consecutive successes for the probe to be considered successful after having failed. Defaults to 1. Must be 1 for liveness and startup. Minimum value is 1.
format: int32
tcpSocket object
TCPSocket specifies a connection to a TCP port.
host string
Optional: Host name to connect to, defaults to the pod IP.
port string | integer required
Number or name of the port to access on the container. Number must be in the range 1 to 65535. Name must be an IANA_SVC_NAME.
terminationGracePeriodSeconds integer
Optional duration in seconds the pod needs to terminate gracefully upon probe failure. The grace period is the duration in seconds after the processes running in the pod are sent a termination signal and the time when the processes are forcibly halted with a kill signal. Set this value longer than the expected cleanup time for your process. If this value is nil, the pod's terminationGracePeriodSeconds will be used. Otherwise, this value overrides the value provided by the pod spec. Value must be non-negative integer. The value zero indicates stop immediately via the kill signal (no opportunity to shut down). This is a beta field and requires enabling ProbeTerminationGracePeriod feature gate. Minimum value is 1. spec.terminationGracePeriodSeconds is used if unset.
format: int64
timeoutSeconds integer
Number of seconds after which the probe times out. Defaults to 1 second. Minimum value is 1. More info: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle#container-probes
format: int32
readiness object
Readiness overrides the readiness probe.
exec object
Exec specifies a command to execute in the container.
command []string
Command is the command line to execute inside the container, the working directory for the command is root ('/') in the container's filesystem. The command is simply exec'd, it is not run inside a shell, so traditional shell instructions ('|', etc) won't work. To use a shell, you need to explicitly call out to that shell. Exit status of 0 is treated as live/healthy and non-zero is unhealthy.
failureThreshold integer
Minimum consecutive failures for the probe to be considered failed after having succeeded. Defaults to 3. Minimum value is 1.
format: int32
grpc object
GRPC specifies a GRPC HealthCheckRequest.
port integer required
Port number of the gRPC service. Number must be in the range 1 to 65535.
format: int32
service string
Service is the name of the service to place in the gRPC HealthCheckRequest (see https://github.com/grpc/grpc/blob/master/doc/health-checking.md). If this is not specified, the default behavior is defined by gRPC.
httpGet object
HTTPGet specifies an HTTP GET request to perform.
host string
Host name to connect to, defaults to the pod IP. You probably want to set "Host" in httpHeaders instead.
httpHeaders []object
Custom headers to set in the request. HTTP allows repeated headers.
name string required
The header field name. This will be canonicalized upon output, so case-variant names will be understood as the same header.
value string required
The header field value
path string
Path to access on the HTTP server.
port string | integer required
Name or number of the port to access on the container. Number must be in the range 1 to 65535. Name must be an IANA_SVC_NAME.
scheme string
Scheme to use for connecting to the host. Defaults to HTTP.
initialDelaySeconds integer
Number of seconds after the container has started before liveness probes are initiated. More info: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle#container-probes
format: int32
periodSeconds integer
How often (in seconds) to perform the probe. Default to 10 seconds. Minimum value is 1.
format: int32
successThreshold integer
Minimum consecutive successes for the probe to be considered successful after having failed. Defaults to 1. Must be 1 for liveness and startup. Minimum value is 1.
format: int32
tcpSocket object
TCPSocket specifies a connection to a TCP port.
host string
Optional: Host name to connect to, defaults to the pod IP.
port string | integer required
Number or name of the port to access on the container. Number must be in the range 1 to 65535. Name must be an IANA_SVC_NAME.
terminationGracePeriodSeconds integer
Optional duration in seconds the pod needs to terminate gracefully upon probe failure. The grace period is the duration in seconds after the processes running in the pod are sent a termination signal and the time when the processes are forcibly halted with a kill signal. Set this value longer than the expected cleanup time for your process. If this value is nil, the pod's terminationGracePeriodSeconds will be used. Otherwise, this value overrides the value provided by the pod spec. Value must be non-negative integer. The value zero indicates stop immediately via the kill signal (no opportunity to shut down). This is a beta field and requires enabling ProbeTerminationGracePeriod feature gate. Minimum value is 1. spec.terminationGracePeriodSeconds is used if unset.
format: int64
timeoutSeconds integer
Number of seconds after which the probe times out. Defaults to 1 second. Minimum value is 1. More info: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle#container-probes
format: int32
startup object
Startup overrides the startup probe.
exec object
Exec specifies a command to execute in the container.
command []string
Command is the command line to execute inside the container, the working directory for the command is root ('/') in the container's filesystem. The command is simply exec'd, it is not run inside a shell, so traditional shell instructions ('|', etc) won't work. To use a shell, you need to explicitly call out to that shell. Exit status of 0 is treated as live/healthy and non-zero is unhealthy.
failureThreshold integer
Minimum consecutive failures for the probe to be considered failed after having succeeded. Defaults to 3. Minimum value is 1.
format: int32
grpc object
GRPC specifies a GRPC HealthCheckRequest.
port integer required
Port number of the gRPC service. Number must be in the range 1 to 65535.
format: int32
service string
Service is the name of the service to place in the gRPC HealthCheckRequest (see https://github.com/grpc/grpc/blob/master/doc/health-checking.md). If this is not specified, the default behavior is defined by gRPC.
httpGet object
HTTPGet specifies an HTTP GET request to perform.
host string
Host name to connect to, defaults to the pod IP. You probably want to set "Host" in httpHeaders instead.
httpHeaders []object
Custom headers to set in the request. HTTP allows repeated headers.
name string required
The header field name. This will be canonicalized upon output, so case-variant names will be understood as the same header.
value string required
The header field value
path string
Path to access on the HTTP server.
port string | integer required
Name or number of the port to access on the container. Number must be in the range 1 to 65535. Name must be an IANA_SVC_NAME.
scheme string
Scheme to use for connecting to the host. Defaults to HTTP.
initialDelaySeconds integer
Number of seconds after the container has started before liveness probes are initiated. More info: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle#container-probes
format: int32
periodSeconds integer
How often (in seconds) to perform the probe. Default to 10 seconds. Minimum value is 1.
format: int32
successThreshold integer
Minimum consecutive successes for the probe to be considered successful after having failed. Defaults to 1. Must be 1 for liveness and startup. Minimum value is 1.
format: int32
tcpSocket object
TCPSocket specifies a connection to a TCP port.
host string
Optional: Host name to connect to, defaults to the pod IP.
port string | integer required
Number or name of the port to access on the container. Number must be in the range 1 to 65535. Name must be an IANA_SVC_NAME.
terminationGracePeriodSeconds integer
Optional duration in seconds the pod needs to terminate gracefully upon probe failure. The grace period is the duration in seconds after the processes running in the pod are sent a termination signal and the time when the processes are forcibly halted with a kill signal. Set this value longer than the expected cleanup time for your process. If this value is nil, the pod's terminationGracePeriodSeconds will be used. Otherwise, this value overrides the value provided by the pod spec. Value must be non-negative integer. The value zero indicates stop immediately via the kill signal (no opportunity to shut down). This is a beta field and requires enabling ProbeTerminationGracePeriod feature gate. Minimum value is 1. spec.terminationGracePeriodSeconds is used if unset.
format: int64
timeoutSeconds integer
Number of seconds after which the probe times out. Defaults to 1 second. Minimum value is 1. More info: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle#container-probes
format: int32
reasoningBudget integer
ReasoningBudget caps the number of reasoning tokens the model is allowed to emit per response. Zero disables visible thinking output entirely; the model still reasons internally but does not emit thinking tokens. Critical for production agentic workloads on thinking models (Qwen 3.6, GLM-5) where runaway reasoning can burn compute. Maps to llama.cpp --reasoning-budget flag.
format: int32
minimum: 0
reasoningBudgetMessage string
ReasoningBudgetMessage is injected when the reasoning budget is exhausted, forcing the model to conclude. Ignored unless ReasoningBudget is also set. Maps to llama.cpp --reasoning-budget-message flag.
replicas integer
Replicas is the desired number of inference pods
format: int32
minimum: 0
maximum: 10
resources object
Resources defines compute resources for inference pods
cpu string
CPU requests (e.g., "2" or "2000m")
gpu integer
GPU count required per pod For multi-GPU inference, each pod gets this many GPUs Note: Multi-GPU sharding config comes from Model CRD
format: int32
minimum: 0
maximum: 8
gpuMemory string
GPUMemory specifies GPU memory limit per pod (e.g., "16Gi") Used for scheduling and validation
hostMemory string
HostMemory specifies the system RAM required for hybrid GPU/CPU offloading (e.g., "64Gi"). Used when MoE expert weights or KV cache are offloaded to CPU via moeCPUOffload or noKvOffload. Translated to pod resources.requests.memory, taking precedence over Memory when set. Without this, the K8s scheduler has no visibility into the pod's actual RAM consumption, which can lead to OOM kills after model load.
memory string
Memory requests (e.g., "4Gi")
ropeScaling object
RopeScaling configures RoPE-based context extension so a model can be served past its native trained context (e.g. 128K served at 256K via YaRN). For the llamacpp runtime this maps to --rope-scaling / --rope-scale / --yarn-orig-ctx. Prefer this over raw spec.extraArgs: it is validated and discoverable via `kubectl explain`. If --rope-scaling is also present in spec.extraArgs, extraArgs wins and this is skipped.
factor string
Factor is the scale multiplier (--rope-scale), e.g. "2.0" to double the native context. A string to avoid CRD float pitfalls; the runtime parses it as a float. Optional.
pattern: ^[0-9]+(\.[0-9]+)?$
originalContext integer
OriginalContext is the model's native training context length (--yarn-orig-ctx), e.g. 131072 for a 128K model. Recommended with yarn.
format: int32
minimum: 128
type string required
Type is the scaling method (--rope-scaling). "yarn" is the usual choice for extending context (e.g. 128K to 256K).
enum: linear, yarn, longrope
runtime string
Runtime selects the inference server backend. "llamacpp" (default): llama.cpp server with auto-generated args and /health probes. "generic": user-provided container with custom command, args, env, and probes. "personaplex": NVIDIA PersonaPlex (Moshi) speech-to-speech server. "vllm": vLLM OpenAI-compatible server with PagedAttention. "tgi": HuggingFace Text Generation Inference server.
enum: llamacpp, personaplex, vllm, tgi, generic
runtimeClassName string
RuntimeClassName selects a Kubernetes RuntimeClass for the inference Pod. Most commonly set to "nvidia" on clusters where the NVIDIA Container Runtime is not configured as the cluster default. Without it, GPU pods schedule onto the GPU node but never get the device files bind-mounted, and the container fails at runtime with "no CUDA-capable device is detected". Maps directly to PodSpec.RuntimeClassName. Most clusters running the NVIDIA GPU Operator with the default toolkit env do not need this set; it is a safety hatch for clusters where the runtime configuration is non-default.
securityContext object
SecurityContext defines container-level security attributes for the inference container.
allowPrivilegeEscalation boolean
AllowPrivilegeEscalation controls whether a process can gain more privileges than its parent process. This bool directly controls if the no_new_privs flag will be set on the container process. AllowPrivilegeEscalation is true always when the container is: 1) run as Privileged 2) has CAP_SYS_ADMIN Note that this field cannot be set when spec.os.name is windows.
appArmorProfile object
appArmorProfile is the AppArmor options to use by this container. If set, this profile overrides the pod's appArmorProfile. Note that this field cannot be set when spec.os.name is windows.
localhostProfile string
localhostProfile indicates a profile loaded on the node that should be used. The profile must be preconfigured on the node to work. Must match the loaded name of the profile. Must be set if and only if type is "Localhost".
type string required
type indicates which kind of AppArmor profile will be applied. Valid options are: Localhost - a profile pre-loaded on the node. RuntimeDefault - the container runtime's default profile. Unconfined - no AppArmor enforcement.
capabilities object
The capabilities to add/drop when running containers. Defaults to the default set of capabilities granted by the container runtime. Note that this field cannot be set when spec.os.name is windows.
add []string
Added capabilities
drop []string
Removed capabilities
privileged boolean
Run container in privileged mode. Processes in privileged containers are essentially equivalent to root on the host. Defaults to false. Note that this field cannot be set when spec.os.name is windows.
procMount string
procMount denotes the type of proc mount to use for the containers. The default value is Default which uses the container runtime defaults for readonly paths and masked paths. Note that this field cannot be set when spec.os.name is windows.
readOnlyRootFilesystem boolean
Whether this container has a read-only root filesystem. Default is false. Note that this field cannot be set when spec.os.name is windows.
runAsGroup integer
The GID to run the entrypoint of the container process. Uses runtime default if unset. May also be set in PodSecurityContext. If set in both SecurityContext and PodSecurityContext, the value specified in SecurityContext takes precedence. Note that this field cannot be set when spec.os.name is windows.
format: int64
runAsNonRoot boolean
Indicates that the container must run as a non-root user. If true, the Kubelet will validate the image at runtime to ensure that it does not run as UID 0 (root) and fail to start the container if it does. If unset or false, no such validation will be performed. May also be set in PodSecurityContext. If set in both SecurityContext and PodSecurityContext, the value specified in SecurityContext takes precedence.
runAsUser integer
The UID to run the entrypoint of the container process. Defaults to user specified in image metadata if unspecified. May also be set in PodSecurityContext. If set in both SecurityContext and PodSecurityContext, the value specified in SecurityContext takes precedence. Note that this field cannot be set when spec.os.name is windows.
format: int64
seLinuxOptions object
The SELinux context to be applied to the container. If unspecified, the container runtime will allocate a random SELinux context for each container. May also be set in PodSecurityContext. If set in both SecurityContext and PodSecurityContext, the value specified in SecurityContext takes precedence. Note that this field cannot be set when spec.os.name is windows.
level string
Level is SELinux level label that applies to the container.
role string
Role is a SELinux role label that applies to the container.
type string
Type is a SELinux type label that applies to the container.
user string
User is a SELinux user label that applies to the container.
seccompProfile object
The seccomp options to use by this container. If seccomp options are provided at both the pod & container level, the container options override the pod options. Note that this field cannot be set when spec.os.name is windows.
localhostProfile string
localhostProfile indicates a profile defined in a file on the node should be used. The profile must be preconfigured on the node to work. Must be a descending path, relative to the kubelet's configured seccomp profile location. Must be set if type is "Localhost". Must NOT be set for any other type.
type string required
type indicates which kind of seccomp profile will be applied. Valid options are: Localhost - a profile defined in a file on the node should be used. RuntimeDefault - the container runtime default profile should be used. Unconfined - no profile should be applied.
windowsOptions object
The Windows specific settings applied to all containers. If unspecified, the options from the PodSecurityContext will be used. If set in both SecurityContext and PodSecurityContext, the value specified in SecurityContext takes precedence. Note that this field cannot be set when spec.os.name is linux.
gmsaCredentialSpec string
GMSACredentialSpec is where the GMSA admission webhook (https://github.com/kubernetes-sigs/windows-gmsa) inlines the contents of the GMSA credential spec named by the GMSACredentialSpecName field.
gmsaCredentialSpecName string
GMSACredentialSpecName is the name of the GMSA credential spec to use.
hostProcess boolean
HostProcess determines if a container should be run as a 'Host Process' container. All of a Pod's containers must have the same effective HostProcess value (it is not allowed to have a mix of HostProcess containers and non-HostProcess containers). In addition, if HostProcess is true then HostNetwork must also be set to true.
runAsUserName string
The UserName in Windows to run the entrypoint of the container process. Defaults to the user specified in image metadata if unspecified. May also be set in PodSecurityContext. If set in both SecurityContext and PodSecurityContext, the value specified in SecurityContext takes precedence.
skipModelInit boolean
SkipModelInit disables the model-downloader init container. Use when the model is baked into the image or downloaded by the container itself (e.g., via HF_TOKEN).
tensorOverrides []string
TensorOverrides provides fine-grained tensor placement overrides for power users. Each entry specifies a tensor name and target device (e.g., "exps=CPU", "token_embd=CUDA0"). Maps to llama.cpp --override-tensor flag (one flag per entry).
tgiConfig object
TGIConfig holds configuration for the TGI runtime. Only used when Runtime is "tgi".
dtype string
Dtype sets the model data type (float16, bfloat16).
enum: float16, bfloat16
hfTokenSecretRef object
HFTokenSecretRef references a Secret containing the HuggingFace token.
key string required
The key of the secret to select from. Must be a valid secret key.
name string
Name of the referent. This field is effectively required, but due to backwards compatibility is allowed to be empty. Instances of this type with an empty value here are almost certainly wrong. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#names
optional boolean
Specify whether the Secret or its key must be defined
maxInputLength integer
MaxInputLength sets the maximum input token length.
format: int32
maxTotalTokens integer
MaxTotalTokens sets the maximum total tokens (input + output).
format: int32
quantize string
Quantize sets the quantization method (bitsandbytes, gptq, awq, eetq).
enum: bitsandbytes, gptq, awq, eetq
tolerations []object
Tolerations for pod scheduling (e.g., GPU taints, spot instances)
effect string
Effect indicates the taint effect to match. Empty means match all taint effects. When specified, allowed values are NoSchedule, PreferNoSchedule and NoExecute.
key string
Key is the taint key that the toleration applies to. Empty means match all taint keys. If the key is empty, operator must be Exists; this combination means to match all values and all keys.
operator string
Operator represents a key's relationship to the value. Valid operators are Exists, Equal, Lt, and Gt. Defaults to Equal. Exists is equivalent to wildcard for value, so that a pod can tolerate all taints of a particular category. Lt and Gt perform numeric comparisons (requires feature gate TaintTolerationComparisonOperators).
tolerationSeconds integer
TolerationSeconds represents the period of time the toleration (which must be of effect NoExecute, otherwise this field is ignored) tolerates the taint. By default, it is not set, which means tolerate the taint forever (do not evict). Zero and negative values will be treated as 0 (evict immediately) by the system.
format: int64
value string
Value is the taint value the toleration matches to. If the operator is Exists, the value should be empty, otherwise just a regular string.
uBatchSize integer
UBatchSize sets the micro-batch size for decoding. Smaller micro-batches reduce memory usage during generation. Maps to llama.cpp --ubatch-size flag.
format: int32
minimum: 1
vllmConfig object
VLLMConfig holds configuration for the vLLM runtime. Only used when Runtime is "vllm".
attentionBackend string
AttentionBackend selects the attention implementation used by vLLM. FLASHINFER is typically fastest on recent NVIDIA GPUs (especially Blackwell); FLASH_ATTN is a solid default; XFORMERS and torch_sdpa are portability fallbacks. Requires a vLLM version that supports the chosen backend. Both uppercase (vLLM's native form) and lowercase spellings are accepted for backwards compatibility with earlier LLMKube releases. Maps to vLLM --attention-backend flag.
enum: FLASH_ATTN, FLASHINFER, XFORMERS, flashinfer, flash_attn, xformers, torch_sdpa
cpuOffloadGB integer
CPUOffloadGB increases the GPU memory size. When set, passes --cpu-offload-gb to vLLM. Per-rank, so 4 on TP=2 means 4 GB of CPU RAM per GPU. Use when FP8 model weights don't fit VRAM. Throughput hit is 2-5x on the offloaded path.
format: int32
minimum: 0
dtype string
Dtype sets the model data type (auto, float16, bfloat16).
enum: auto, float16, bfloat16
enableChunkedPrefill boolean
EnableChunkedPrefill interleaves long prefills with decode steps so a large paste (e.g. a 32K-token file) does not starve concurrent decode streams. Only emitted when explicitly set to true. Maps to vLLM --enable-chunked-prefill flag.
enableExpertParallel boolean
EnableExpertParallel distributes MoE experts across tensor-parallel ranks instead of replicating them. Only meaningful for MoE models. Maps to vLLM --enable-expert-parallel flag.
enablePrefixCaching boolean
EnablePrefixCaching turns on vLLM's automatic prefix caching for repeated prompts. Significantly reduces time-to-first-token for conversational and agentic workloads where requests share a common system prompt. Only emitted when explicitly set to true — when nil or false, vLLM's own default is used (do not emit the flag). Maps to vLLM --enable-prefix-caching flag.
gpuMemoryUtilization number
GPUMemoryUtilization controls how much GPU memory each stage can use. When set, passes --gpu-memory-utilization to vLLM. Range from 0.1 - 0.99 and default unset (vLLM uses 0.90).
minimum: 0.1
maximum: 0.99
hfTokenSecretRef object
HFTokenSecretRef references a Secret containing the HuggingFace token.
key string required
The key of the secret to select from. Must be a valid secret key.
name string
Name of the referent. This field is effectively required, but due to backwards compatibility is allowed to be empty. Instances of this type with an empty value here are almost certainly wrong. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#names
optional boolean
Specify whether the Secret or its key must be defined
kvCacheCustomDtype string
KVCacheCustomDtype sets a custom vLLM KV cache element type that is not in the standard enum. Used for vLLM versions with additional cache formats such as TurboQuant 2-bit (turbo2, shipped in v0.20.0). Maps to vLLM --kv-cache-dtype. The runtime image must understand the value or vLLM will fail to start; LLMKube does not validate the string. Mirrors the llama.cpp-side CacheTypeCustomK/V escape hatch. Takes precedence over KVCacheDtype when both are set.
kvCacheDtype string
KVCacheDtype selects the KV cache element type. fp8_e5m2 and fp8_e4m3 cut KV cache memory roughly in half versus auto (which follows dtype), which is what unlocks 128K+ context on consumer VRAM for agentic workloads. Maps to vLLM --kv-cache-dtype flag. For custom build types not in the enum (e.g. TurboQuant turbo2 from vLLM v0.20+), use KVCacheCustomDtype instead.
enum: auto, fp8_e5m2, fp8_e4m3
maxModelLen integer
MaxModelLen sets the maximum model context length.
format: int32
maxNumBatchedTokens integer
MaxNumBatchedTokens sets the maximum number of tokens batched together per step. This is the main throughput knob: too low means prefill-bound, too high risks OOM on long context. No default — only emitted when set. Maps to vLLM --max-num-batched-tokens flag.
format: int32
minimum: 512
quantization string
Quantization method. awq, gptq, squeezellm are classic 4-bit formats. fp8 targets 8-bit FP checkpoints (Qwen FP8, Llama FP8, etc.). nvfp4 is NVIDIA's Blackwell-native 4-bit format. compressed-tensors is the neuralmagic/vLLM cross-format loader used by Unsloth and other recent releases.
enum: awq, gptq, squeezellm, fp8, nvfp4, compressed-tensors
speculative object
Speculative enables draft-model speculative decoding. On single-stream agentic workloads this can be 30-60% faster than plain tensor-parallel execution. Requires a second (smaller) Model CR to act as the draft.
enabled boolean
Enabled toggles speculative decoding on. When false or nil, no speculative flags are emitted regardless of other fields.
model string
Model references the Model CR (in the same namespace as the InferenceService) to use as the speculative draft model. Required when Enabled is true. If missing, speculative decoding is skipped and the InferenceService surfaces a SpeculativeInvalid status condition rather than failing the reconcile. Maps to vLLM --speculative-model flag.
numSpeculativeTokens integer
NumSpeculativeTokens is the number of draft tokens proposed per step. Typical sweet spot is 3-5; higher values increase wasted work when the draft disagrees with the target model. Maps to vLLM --num-speculative-tokens flag.
format: int32
minimum: 1
maximum: 16
tensorParallelSize integer
TensorParallelSize sets the number of GPUs for tensor parallelism.
format: int32
status object
status defines the observed state of InferenceService
conditions []object
conditions represent the current state of the InferenceService resource. Each condition has a unique type and reflects the status of a specific aspect of the resource. Standard condition types include: - "Available": the resource is fully functional - "Progressing": the resource is being created or updated - "Degraded": the resource failed to reach or maintain its desired state The status of each condition is one of True, False, or Unknown.
lastTransitionTime string required
lastTransitionTime is the last time the condition transitioned from one status to another. This should be when the underlying condition changed. If that is not known, then using the time when the API field changed is acceptable.
format: date-time
message string required
message is a human readable message indicating details about the transition. This may be an empty string.
maxLength: 32768
observedGeneration integer
observedGeneration represents the .metadata.generation that the condition was set based upon. For instance, if .metadata.generation is currently 12, but the .status.conditions[x].observedGeneration is 9, the condition is out of date with respect to the current state of the instance.
format: int64
minimum: 0
reason string required
reason contains a programmatic identifier indicating the reason for the condition's last transition. Producers of specific condition types may define expected values and meanings for this field, and whether the values are considered a guaranteed API. The value should be a CamelCase string. This field may not be empty.
pattern: ^[A-Za-z]([A-Za-z0-9_,:]*[A-Za-z0-9_])?$
minLength: 1
maxLength: 1024
status string required
status of the condition, one of True, False, Unknown.
enum: True, False, Unknown
type string required
type of condition in CamelCase or in foo.example.com/CamelCase.
pattern: ^([a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*/)?(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])$
maxLength: 316
desiredReplicas integer
DesiredReplicas is the desired number of replicas
format: int32
effectivePriority integer
EffectivePriority shows the resolved priority value from the applied PriorityClass
format: int32
endpoint string
Endpoint is the service URL where inference requests can be sent
lastUpdated string
LastUpdated is the timestamp of the last status update
format: date-time
modelReady boolean
ModelReady indicates if the referenced Model is in Ready state
phase string
Phase represents the current lifecycle phase of the InferenceService. Possible values: Pending, Creating, Progressing, Ready, WaitingForGPU, Stopped, Failed. Stopped is the terminal state when spec.replicas=0 has caused the agent to tear down the workload; tooling polling for readiness should treat Stopped the same as Pending (the user intentionally took the service offline; this is not an error).
enum: Pending, Creating, Progressing, Ready, WaitingForGPU, Stopped, Failed
queuePosition integer
QueuePosition indicates position among pending InferenceServices cluster-wide (0 = not queued)
format: int32
readyReplicas integer
Replicas tracks the number of ready vs desired pods
format: int32
replicas integer
Replicas is the current number of running inference pods
format: int32
schedulingMessage string
SchedulingMessage provides details about scheduling issues
schedulingStatus string
SchedulingStatus indicates why pods cannot be scheduled (e.g., "InsufficientGPU")
waitingFor string
WaitingFor describes the resource constraint (e.g., "nvidia.com/gpu: 1")

No matches. Try .spec.args for an exact path