InferenceService
inference.llmkube.dev / v1alpha1
apiVersion: inference.llmkube.dev/v1alpha1
kind: InferenceService
metadata:
name: example
apiVersion
string
APIVersion defines the versioned schema of this representation of an object.
Servers should convert recognized schemas to the latest internal value, and
may reject unrecognized values.
More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources
kind
string
Kind is a string value representing the REST resource this object represents.
Servers may infer this from the endpoint the client submits requests to.
Cannot be updated.
In CamelCase.
More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds
metadata
object
spec object required
spec defines the desired state of InferenceService
args
[]string
Args overrides the container arguments entirely.
Only used when Runtime is "generic". For llamacpp, use ExtraArgs instead.
autoscaling object
Autoscaling configures horizontal pod autoscaling for the inference service.
When set, the controller creates and manages an HPA resource targeting the
inference Deployment. Requires Prometheus Adapter for custom metrics.
Mutually exclusive with manual replica management: when autoscaling is enabled,
the Replicas field serves as the initial replica count only.
maxReplicas
integer required
MaxReplicas is the upper limit for the number of replicas.
format:
int32minimum:
1maximum:
100metrics []object
Metrics defines the scaling metrics and target values.
If empty, defaults to llamacpp:requests_processing with target average value of 2.
name
string required
Name is the metric name (e.g., llamacpp:requests_processing).
targetAverageUtilization
integer
TargetAverageUtilization is the target utilization percentage for Resource-type metrics.
format:
int32
targetAverageValue
string
TargetAverageValue is the target per-pod average for Pods-type metrics.
type
string required
Type is the metric source type.
enum:
Pods, Resource
minReplicas
integer
MinReplicas is the lower limit for the number of replicas.
format:
int32minimum:
1maximum:
10
batchSize
integer
BatchSize sets the token batch size for prompt processing.
Larger values improve throughput but use more memory.
Maps to llama.cpp --batch-size flag.
format:
int32minimum:
1maximum:
16384
cacheTypeCustomK
string
CacheTypeCustomK sets a custom KV cache type for keys that is not in the
standard enum. Used for llama.cpp forks with additional cache formats such
as TurboQuant (turbo3, turbo4, tbqp3, etc.). Maps to llama.cpp
--cache-type-k. The runtime binary must understand the value or llama-server
will fail to start; LLMKube does not validate the string.
Takes precedence over CacheTypeK when both are set.
cacheTypeCustomV
string
CacheTypeCustomV sets a custom KV cache type for values that is not in the
standard enum. See CacheTypeCustomK for usage notes. Takes precedence over
CacheTypeV when both are set.
cacheTypeK
string
CacheTypeK sets the KV cache quantization type for keys.
Supported values depend on the llama.cpp build version.
Maps to llama.cpp --cache-type-k flag. Default: f16 (llama.cpp default).
For custom build types not in the enum (e.g. TurboQuant turbo3, tbqp3), use
CacheTypeCustomK instead.
enum:
f16, f32, q8_0, q4_0, q4_1, q5_0, q5_1, iq4_nl
cacheTypeV
string
CacheTypeV sets the KV cache quantization type for values.
Maps to llama.cpp --cache-type-v flag. Default: f16 (llama.cpp default).
For custom build types not in the enum (e.g. TurboQuant turbo3, tbqp3), use
CacheTypeCustomV instead.
enum:
f16, f32, q8_0, q4_0, q4_1, q5_0, q5_1, iq4_nl
command
[]string
Command overrides the container entrypoint.
Only used when Runtime is "generic" or for advanced customization.
containerPort
integer
ContainerPort overrides the primary container port.
Each runtime has its own default (llamacpp: 8080).
format:
int32minimum:
1maximum:
65535
contextSize
integer
ContextSize sets the context window size for the llama.cpp server (-c flag).
Larger values allow processing longer inputs but require more memory.
If not specified, llama.cpp uses its default (typically 512 or 2048).
The upper bound covers Qwen 3.6 at 1M-via-YaRN with margin and accommodates
near-future hybrid-attention model architectures. KV cache memory is the
user's responsibility to size via spec.resources.memory or hostMemory.
format:
int32minimum:
128maximum:
2.097152e+06endpoint object
Endpoint defines the service endpoint configuration
path
string
Path is the HTTP path for the inference endpoint
port
integer
Port is the service port
format:
int32minimum:
1maximum:
65535
type
string
Type is the Kubernetes service type (ClusterIP, NodePort, LoadBalancer)
enum:
ClusterIP, NodePort, LoadBalancerenv []object
Env adds environment variables to the inference container.
Useful for HF_TOKEN, custom runtime config, etc.
name
string required
Name of the environment variable.
May consist of any printable ASCII characters except '='.
value
string
Variable references $(VAR_NAME) are expanded
using the previously defined environment variables in the container and
any service environment variables. If a variable cannot be resolved,
the reference in the input string will be unchanged. Double $$ are reduced
to a single $, which allows for escaping the $(VAR_NAME) syntax: i.e.
"$$(VAR_NAME)" will produce the string literal "$(VAR_NAME)".
Escaped references will never be expanded, regardless of whether the variable
exists or not.
Defaults to "".
valueFrom object
Source for the environment variable's value. Cannot be used if value is not empty.
configMapKeyRef object
Selects a key of a ConfigMap.
key
string required
The key to select.
name
string
Name of the referent.
This field is effectively required, but due to backwards compatibility is
allowed to be empty. Instances of this type with an empty value here are
almost certainly wrong.
More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#names
optional
boolean
Specify whether the ConfigMap or its key must be defined
fieldRef object
Selects a field of the pod: supports metadata.name, metadata.namespace, `metadata.labels['<KEY>']`, `metadata.annotations['<KEY>']`,
spec.nodeName, spec.serviceAccountName, status.hostIP, status.podIP, status.podIPs.
apiVersion
string
Version of the schema the FieldPath is written in terms of, defaults to "v1".
fieldPath
string required
Path of the field to select in the specified API version.
fileKeyRef object
FileKeyRef selects a key of the env file.
Requires the EnvFiles feature gate to be enabled.
key
string required
The key within the env file. An invalid key will prevent the pod from starting.
The keys defined within a source may consist of any printable ASCII characters except '='.
During Alpha stage of the EnvFiles feature gate, the key size is limited to 128 characters.
optional
boolean
Specify whether the file or its key must be defined. If the file or key
does not exist, then the env var is not published.
If optional is set to true and the specified key does not exist,
the environment variable will not be set in the Pod's containers.
If optional is set to false and the specified key does not exist,
an error will be returned during Pod creation.
path
string required
The path within the volume from which to select the file.
Must be relative and may not contain the '..' path or start with '..'.
volumeName
string required
The name of the volume mount containing the env file.
resourceFieldRef object
Selects a resource of the container: only resources limits and requests
(limits.cpu, limits.memory, limits.ephemeral-storage, requests.cpu, requests.memory and requests.ephemeral-storage) are currently supported.
containerName
string
Container name: required for volumes, optional for env vars
divisor
string | integer
Specifies the output format of the exposed resources, defaults to "1"
string pattern:
^(\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))(([KMGTPE]i)|[numkMGTPE]|([eE](\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))))?$
resource
string required
Required: resource to select
secretKeyRef object
Selects a key of a secret in the pod's namespace
key
string required
The key of the secret to select from. Must be a valid secret key.
name
string
Name of the referent.
This field is effectively required, but due to backwards compatibility is
allowed to be empty. Instances of this type with an empty value here are
almost certainly wrong.
More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#names
optional
boolean
Specify whether the Secret or its key must be defined
evictionProtection
boolean
EvictionProtection marks this service as ineligible for memory-pressure
eviction by the metal-agent watchdog. Use this for production workloads
that should never be silently stopped under memory pressure, even when
they are the lowest-priority option. The agent's per-process pickEvictionTarget
excludes protected processes from the eviction-candidate set; the
MemoryPressure status condition is still patched on protected services
for operator visibility.
Has no effect when --eviction-enabled is unset on the metal-agent or
for non-llama-server runtimes (oMLX, Ollama). Defaults to false.
extraArgs
[]string
ExtraArgs provides additional command-line arguments passed directly to the
runtime process. Use for flags not yet supported as typed CRD fields.
Arguments are appended after all other configured flags.
Supported by the "llamacpp" and "vllm" runtimes. Ignored by others.
Example: ["--seed", "42", "--log-disable"]
flashAttention
boolean
FlashAttention enables flash attention for faster prompt processing and
reduced KV cache memory. Maps to llama.cpp --flash-attn flag.
On NVIDIA GPUs requires Ampere or newer (compute capability 8.0+).
On Apple Silicon (Metal agent path) the default is true when this field
is unset, because the wired-collector + flash-attn combination prevents
the ~25% decode degradation observed at long context on Qwen-class
models running on M-series chips.
image
string
Image is the container image for the inference runtime.
For llamacpp runtime, defaults to ghcr.io/ggml-org/llama.cpp:server.
For generic runtime, this field is required.
imagePullSecrets []object
ImagePullSecrets for pulling container images from private registries.
name
string
Name of the referent.
This field is effectively required, but due to backwards compatibility is
allowed to be empty. Instances of this type with an empty value here are
almost certainly wrong.
More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#names
jinja
boolean
Jinja enables Jinja2 chat template rendering for tool/function calling support.
Required when using the OpenAI-compatible API with tools. Maps to llama.cpp --jinja flag.
metadataOverrides
[]string
MetadataOverrides overrides GGUF metadata key-value pairs at model load time.
Each entry is passed as a separate --override-kv flag. Format: key=type:value
(e.g., "qwen35moe.context_length=int:1048576" to extend context window, or
"tokenizer.chat_template.thinking=bool:false" to tweak tokenizer behavior).
Maps to llama.cpp --override-kv flag (one flag per entry).
modelRef
string required
ModelRef references the Model CR that contains the model to serve
moeCPULayers
integer
MoeCPULayers sets the number of MoE layers to offload to CPU.
When set, only the specified number of MoE layers run on CPU rather than all.
Maps to llama.cpp --n-cpu-moe flag.
format:
int32minimum:
0
moeCPUOffload
boolean
MoeCPUOffload offloads all MoE expert layers to CPU for reduced VRAM usage.
Enables running large MoE models (e.g., Qwen3-30B, Mixtral) on VRAM-constrained
hardware by keeping attention layers on GPU while expert weights use system RAM.
Maps to llama.cpp --cpu-moe flag. Requires sufficient system RAM via resources.memory.
noKvOffload
boolean
NoKvOffload keeps the KV cache in system RAM instead of VRAM.
Useful for extended context windows when VRAM is constrained by model weights.
Maps to llama.cpp --no-kv-offload flag. Requires sufficient system RAM via resources.memory.
noWarmup
boolean
NoWarmup skips the llama.cpp startup warmup inference pass.
Reduces pod ready time at the cost of slightly higher first-request latency.
Useful for scale-to-zero and quick redeployment patterns.
Maps to llama.cpp --no-warmup flag.
nodeSelector
object
NodeSelector for pod placement (e.g., specific node pools)
parallelSlots
integer
ParallelSlots sets the number of concurrent request slots for the llama.cpp
server (--parallel flag). Each slot processes one request independently;
higher values use more KV cache memory. If not specified, the operator
omits --parallel and llama.cpp picks an auto value (currently 4).
format:
int32minimum:
1maximum:
64personaPlexConfig object
PersonaPlexConfig holds configuration for the PersonaPlex (Moshi) runtime.
Only used when Runtime is "personaplex".
cpuOffload
boolean
CPUOffload enables model weight offloading to system RAM when GPU VRAM is insufficient.
Requires the accelerate package in the container image.
hfTokenSecretRef object
HFTokenSecretRef references a Secret containing the HuggingFace token for model download.
The Secret key must be "HF_TOKEN".
key
string required
The key of the secret to select from. Must be a valid secret key.
name
string
Name of the referent.
This field is effectively required, but due to backwards compatibility is
allowed to be empty. Instances of this type with an empty value here are
almost certainly wrong.
More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#names
optional
boolean
Specify whether the Secret or its key must be defined
quantize4Bit
boolean
Quantize4Bit enables NF4 4-bit quantization for reduced VRAM usage (~9.6 GB vs ~14 GB).
Requires the bitsandbytes package in the container image.
podAnnotations
object
PodAnnotations are merged into the inference Pod's metadata.annotations.
Use this to tag Pods for downstream tooling (cost attribution, service
mesh routing, custom admission controllers) without those tools needing
to know about LLMKube's CRD schema. Pure passthrough; the operator
itself does not set any annotations on inference Pods today.
podLabels
object
PodLabels are merged into the inference Pod's metadata.labels alongside
the operator-managed labels (`app`, `inference.llmkube.dev/model`,
`inference.llmkube.dev/service`). Operator-managed keys take precedence
on collision so the Deployment selector stays in sync with the Pods it
owns. The Deployment selector itself uses only the operator-managed
labels and is immutable, so changing PodLabels later is safe.
podSecurityContext object
PodSecurityContext defines pod-level security attributes for inference pods.
Use this to set fsGroup for volume permissions (required on OpenShift).
appArmorProfile object
appArmorProfile is the AppArmor options to use by the containers in this pod.
Note that this field cannot be set when spec.os.name is windows.
localhostProfile
string
localhostProfile indicates a profile loaded on the node that should be used.
The profile must be preconfigured on the node to work.
Must match the loaded name of the profile.
Must be set if and only if type is "Localhost".
type
string required
type indicates which kind of AppArmor profile will be applied.
Valid options are:
Localhost - a profile pre-loaded on the node.
RuntimeDefault - the container runtime's default profile.
Unconfined - no AppArmor enforcement.
fsGroup
integer
A special supplemental group that applies to all containers in a pod.
Some volume types allow the Kubelet to change the ownership of that volume
to be owned by the pod:
1. The owning GID will be the FSGroup
2. The setgid bit is set (new files created in the volume will be owned by FSGroup)
3. The permission bits are OR'd with rw-rw----
If unset, the Kubelet will not modify the ownership and permissions of any volume.
Note that this field cannot be set when spec.os.name is windows.
format:
int64
fsGroupChangePolicy
string
fsGroupChangePolicy defines behavior of changing ownership and permission of the volume
before being exposed inside Pod. This field will only apply to
volume types which support fsGroup based ownership(and permissions).
It will have no effect on ephemeral volume types such as: secret, configmaps
and emptydir.
Valid values are "OnRootMismatch" and "Always". If not specified, "Always" is used.
Note that this field cannot be set when spec.os.name is windows.
runAsGroup
integer
The GID to run the entrypoint of the container process.
Uses runtime default if unset.
May also be set in SecurityContext. If set in both SecurityContext and
PodSecurityContext, the value specified in SecurityContext takes precedence
for that container.
Note that this field cannot be set when spec.os.name is windows.
format:
int64
runAsNonRoot
boolean
Indicates that the container must run as a non-root user.
If true, the Kubelet will validate the image at runtime to ensure that it
does not run as UID 0 (root) and fail to start the container if it does.
If unset or false, no such validation will be performed.
May also be set in SecurityContext. If set in both SecurityContext and
PodSecurityContext, the value specified in SecurityContext takes precedence.
runAsUser
integer
The UID to run the entrypoint of the container process.
Defaults to user specified in image metadata if unspecified.
May also be set in SecurityContext. If set in both SecurityContext and
PodSecurityContext, the value specified in SecurityContext takes precedence
for that container.
Note that this field cannot be set when spec.os.name is windows.
format:
int64
seLinuxChangePolicy
string
seLinuxChangePolicy defines how the container's SELinux label is applied to all volumes used by the Pod.
It has no effect on nodes that do not support SELinux or to volumes does not support SELinux.
Valid values are "MountOption" and "Recursive".
"Recursive" means relabeling of all files on all Pod volumes by the container runtime.
This may be slow for large volumes, but allows mixing privileged and unprivileged Pods sharing the same volume on the same node.
"MountOption" mounts all eligible Pod volumes with `-o context` mount option.
This requires all Pods that share the same volume to use the same SELinux label.
It is not possible to share the same volume among privileged and unprivileged Pods.
Eligible volumes are in-tree FibreChannel and iSCSI volumes, and all CSI volumes
whose CSI driver announces SELinux support by setting spec.seLinuxMount: true in their
CSIDriver instance. Other volumes are always re-labelled recursively.
"MountOption" value is allowed only when SELinuxMount feature gate is enabled.
If not specified and SELinuxMount feature gate is enabled, "MountOption" is used.
If not specified and SELinuxMount feature gate is disabled, "MountOption" is used for ReadWriteOncePod volumes
and "Recursive" for all other volumes.
This field affects only Pods that have SELinux label set, either in PodSecurityContext or in SecurityContext of all containers.
All Pods that use the same volume should use the same seLinuxChangePolicy, otherwise some pods can get stuck in ContainerCreating state.
Note that this field cannot be set when spec.os.name is windows.
seLinuxOptions object
The SELinux context to be applied to all containers.
If unspecified, the container runtime will allocate a random SELinux context for each
container. May also be set in SecurityContext. If set in
both SecurityContext and PodSecurityContext, the value specified in SecurityContext
takes precedence for that container.
Note that this field cannot be set when spec.os.name is windows.
level
string
Level is SELinux level label that applies to the container.
role
string
Role is a SELinux role label that applies to the container.
type
string
Type is a SELinux type label that applies to the container.
user
string
User is a SELinux user label that applies to the container.
seccompProfile object
The seccomp options to use by the containers in this pod.
Note that this field cannot be set when spec.os.name is windows.
localhostProfile
string
localhostProfile indicates a profile defined in a file on the node should be used.
The profile must be preconfigured on the node to work.
Must be a descending path, relative to the kubelet's configured seccomp profile location.
Must be set if type is "Localhost". Must NOT be set for any other type.
type
string required
type indicates which kind of seccomp profile will be applied.
Valid options are:
Localhost - a profile defined in a file on the node should be used.
RuntimeDefault - the container runtime default profile should be used.
Unconfined - no profile should be applied.
supplementalGroups
[]integer
A list of groups applied to the first process run in each container, in
addition to the container's primary GID and fsGroup (if specified). If
the SupplementalGroupsPolicy feature is enabled, the
supplementalGroupsPolicy field determines whether these are in addition
to or instead of any group memberships defined in the container image.
If unspecified, no additional groups are added, though group memberships
defined in the container image may still be used, depending on the
supplementalGroupsPolicy field.
Note that this field cannot be set when spec.os.name is windows.
supplementalGroupsPolicy
string
Defines how supplemental groups of the first container processes are calculated.
Valid values are "Merge" and "Strict". If not specified, "Merge" is used.
(Alpha) Using the field requires the SupplementalGroupsPolicy feature gate to be enabled
and the container runtime must implement support for this feature.
Note that this field cannot be set when spec.os.name is windows.
sysctls []object
Sysctls hold a list of namespaced sysctls used for the pod. Pods with unsupported
sysctls (by the container runtime) might fail to launch.
Note that this field cannot be set when spec.os.name is windows.
name
string required
Name of a property to set
value
string required
Value of a property to set
windowsOptions object
The Windows specific settings applied to all containers.
If unspecified, the options within a container's SecurityContext will be used.
If set in both SecurityContext and PodSecurityContext, the value specified in SecurityContext takes precedence.
Note that this field cannot be set when spec.os.name is linux.
gmsaCredentialSpec
string
GMSACredentialSpec is where the GMSA admission webhook
(https://github.com/kubernetes-sigs/windows-gmsa) inlines the contents of the
GMSA credential spec named by the GMSACredentialSpecName field.
gmsaCredentialSpecName
string
GMSACredentialSpecName is the name of the GMSA credential spec to use.
hostProcess
boolean
HostProcess determines if a container should be run as a 'Host Process' container.
All of a Pod's containers must have the same effective HostProcess value
(it is not allowed to have a mix of HostProcess containers and non-HostProcess containers).
In addition, if HostProcess is true then HostNetwork must also be set to true.
runAsUserName
string
The UserName in Windows to run the entrypoint of the container process.
Defaults to the user specified in image metadata if unspecified.
May also be set in PodSecurityContext. If set in both SecurityContext and
PodSecurityContext, the value specified in SecurityContext takes precedence.
priority
string
Priority determines scheduling priority for GPU allocation.
Higher priority services can preempt lower priority ones when GPUs are scarce.
enum:
critical, high, normal, low, batch
priorityClassName
string
PriorityClassName allows specifying a custom Kubernetes PriorityClass.
Takes precedence over the Priority field if set.
probeOverrides object
ProbeOverrides allows replacing the auto-generated health probes.
Useful for runtimes with non-HTTP health endpoints (e.g., TCP, WebSocket).
liveness object
Liveness overrides the liveness probe.
exec object
Exec specifies a command to execute in the container.
command
[]string
Command is the command line to execute inside the container, the working directory for the
command is root ('/') in the container's filesystem. The command is simply exec'd, it is
not run inside a shell, so traditional shell instructions ('|', etc) won't work. To use
a shell, you need to explicitly call out to that shell.
Exit status of 0 is treated as live/healthy and non-zero is unhealthy.
failureThreshold
integer
Minimum consecutive failures for the probe to be considered failed after having succeeded.
Defaults to 3. Minimum value is 1.
format:
int32grpc object
GRPC specifies a GRPC HealthCheckRequest.
port
integer required
Port number of the gRPC service. Number must be in the range 1 to 65535.
format:
int32
service
string
Service is the name of the service to place in the gRPC HealthCheckRequest
(see https://github.com/grpc/grpc/blob/master/doc/health-checking.md).
If this is not specified, the default behavior is defined by gRPC.
httpGet object
HTTPGet specifies an HTTP GET request to perform.
host
string
Host name to connect to, defaults to the pod IP. You probably want to set
"Host" in httpHeaders instead.
httpHeaders []object
Custom headers to set in the request. HTTP allows repeated headers.
name
string required
The header field name.
This will be canonicalized upon output, so case-variant names will be understood as the same header.
value
string required
The header field value
path
string
Path to access on the HTTP server.
port
string | integer required
Name or number of the port to access on the container.
Number must be in the range 1 to 65535.
Name must be an IANA_SVC_NAME.
scheme
string
Scheme to use for connecting to the host.
Defaults to HTTP.
initialDelaySeconds
integer
Number of seconds after the container has started before liveness probes are initiated.
More info: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle#container-probes
format:
int32
periodSeconds
integer
How often (in seconds) to perform the probe.
Default to 10 seconds. Minimum value is 1.
format:
int32
successThreshold
integer
Minimum consecutive successes for the probe to be considered successful after having failed.
Defaults to 1. Must be 1 for liveness and startup. Minimum value is 1.
format:
int32tcpSocket object
TCPSocket specifies a connection to a TCP port.
host
string
Optional: Host name to connect to, defaults to the pod IP.
port
string | integer required
Number or name of the port to access on the container.
Number must be in the range 1 to 65535.
Name must be an IANA_SVC_NAME.
terminationGracePeriodSeconds
integer
Optional duration in seconds the pod needs to terminate gracefully upon probe failure.
The grace period is the duration in seconds after the processes running in the pod are sent
a termination signal and the time when the processes are forcibly halted with a kill signal.
Set this value longer than the expected cleanup time for your process.
If this value is nil, the pod's terminationGracePeriodSeconds will be used. Otherwise, this
value overrides the value provided by the pod spec.
Value must be non-negative integer. The value zero indicates stop immediately via
the kill signal (no opportunity to shut down).
This is a beta field and requires enabling ProbeTerminationGracePeriod feature gate.
Minimum value is 1. spec.terminationGracePeriodSeconds is used if unset.
format:
int64
timeoutSeconds
integer
Number of seconds after which the probe times out.
Defaults to 1 second. Minimum value is 1.
More info: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle#container-probes
format:
int32readiness object
Readiness overrides the readiness probe.
exec object
Exec specifies a command to execute in the container.
command
[]string
Command is the command line to execute inside the container, the working directory for the
command is root ('/') in the container's filesystem. The command is simply exec'd, it is
not run inside a shell, so traditional shell instructions ('|', etc) won't work. To use
a shell, you need to explicitly call out to that shell.
Exit status of 0 is treated as live/healthy and non-zero is unhealthy.
failureThreshold
integer
Minimum consecutive failures for the probe to be considered failed after having succeeded.
Defaults to 3. Minimum value is 1.
format:
int32grpc object
GRPC specifies a GRPC HealthCheckRequest.
port
integer required
Port number of the gRPC service. Number must be in the range 1 to 65535.
format:
int32
service
string
Service is the name of the service to place in the gRPC HealthCheckRequest
(see https://github.com/grpc/grpc/blob/master/doc/health-checking.md).
If this is not specified, the default behavior is defined by gRPC.
httpGet object
HTTPGet specifies an HTTP GET request to perform.
host
string
Host name to connect to, defaults to the pod IP. You probably want to set
"Host" in httpHeaders instead.
httpHeaders []object
Custom headers to set in the request. HTTP allows repeated headers.
name
string required
The header field name.
This will be canonicalized upon output, so case-variant names will be understood as the same header.
value
string required
The header field value
path
string
Path to access on the HTTP server.
port
string | integer required
Name or number of the port to access on the container.
Number must be in the range 1 to 65535.
Name must be an IANA_SVC_NAME.
scheme
string
Scheme to use for connecting to the host.
Defaults to HTTP.
initialDelaySeconds
integer
Number of seconds after the container has started before liveness probes are initiated.
More info: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle#container-probes
format:
int32
periodSeconds
integer
How often (in seconds) to perform the probe.
Default to 10 seconds. Minimum value is 1.
format:
int32
successThreshold
integer
Minimum consecutive successes for the probe to be considered successful after having failed.
Defaults to 1. Must be 1 for liveness and startup. Minimum value is 1.
format:
int32tcpSocket object
TCPSocket specifies a connection to a TCP port.
host
string
Optional: Host name to connect to, defaults to the pod IP.
port
string | integer required
Number or name of the port to access on the container.
Number must be in the range 1 to 65535.
Name must be an IANA_SVC_NAME.
terminationGracePeriodSeconds
integer
Optional duration in seconds the pod needs to terminate gracefully upon probe failure.
The grace period is the duration in seconds after the processes running in the pod are sent
a termination signal and the time when the processes are forcibly halted with a kill signal.
Set this value longer than the expected cleanup time for your process.
If this value is nil, the pod's terminationGracePeriodSeconds will be used. Otherwise, this
value overrides the value provided by the pod spec.
Value must be non-negative integer. The value zero indicates stop immediately via
the kill signal (no opportunity to shut down).
This is a beta field and requires enabling ProbeTerminationGracePeriod feature gate.
Minimum value is 1. spec.terminationGracePeriodSeconds is used if unset.
format:
int64
timeoutSeconds
integer
Number of seconds after which the probe times out.
Defaults to 1 second. Minimum value is 1.
More info: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle#container-probes
format:
int32startup object
Startup overrides the startup probe.
exec object
Exec specifies a command to execute in the container.
command
[]string
Command is the command line to execute inside the container, the working directory for the
command is root ('/') in the container's filesystem. The command is simply exec'd, it is
not run inside a shell, so traditional shell instructions ('|', etc) won't work. To use
a shell, you need to explicitly call out to that shell.
Exit status of 0 is treated as live/healthy and non-zero is unhealthy.
failureThreshold
integer
Minimum consecutive failures for the probe to be considered failed after having succeeded.
Defaults to 3. Minimum value is 1.
format:
int32grpc object
GRPC specifies a GRPC HealthCheckRequest.
port
integer required
Port number of the gRPC service. Number must be in the range 1 to 65535.
format:
int32
service
string
Service is the name of the service to place in the gRPC HealthCheckRequest
(see https://github.com/grpc/grpc/blob/master/doc/health-checking.md).
If this is not specified, the default behavior is defined by gRPC.
httpGet object
HTTPGet specifies an HTTP GET request to perform.
host
string
Host name to connect to, defaults to the pod IP. You probably want to set
"Host" in httpHeaders instead.
httpHeaders []object
Custom headers to set in the request. HTTP allows repeated headers.
name
string required
The header field name.
This will be canonicalized upon output, so case-variant names will be understood as the same header.
value
string required
The header field value
path
string
Path to access on the HTTP server.
port
string | integer required
Name or number of the port to access on the container.
Number must be in the range 1 to 65535.
Name must be an IANA_SVC_NAME.
scheme
string
Scheme to use for connecting to the host.
Defaults to HTTP.
initialDelaySeconds
integer
Number of seconds after the container has started before liveness probes are initiated.
More info: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle#container-probes
format:
int32
periodSeconds
integer
How often (in seconds) to perform the probe.
Default to 10 seconds. Minimum value is 1.
format:
int32
successThreshold
integer
Minimum consecutive successes for the probe to be considered successful after having failed.
Defaults to 1. Must be 1 for liveness and startup. Minimum value is 1.
format:
int32tcpSocket object
TCPSocket specifies a connection to a TCP port.
host
string
Optional: Host name to connect to, defaults to the pod IP.
port
string | integer required
Number or name of the port to access on the container.
Number must be in the range 1 to 65535.
Name must be an IANA_SVC_NAME.
terminationGracePeriodSeconds
integer
Optional duration in seconds the pod needs to terminate gracefully upon probe failure.
The grace period is the duration in seconds after the processes running in the pod are sent
a termination signal and the time when the processes are forcibly halted with a kill signal.
Set this value longer than the expected cleanup time for your process.
If this value is nil, the pod's terminationGracePeriodSeconds will be used. Otherwise, this
value overrides the value provided by the pod spec.
Value must be non-negative integer. The value zero indicates stop immediately via
the kill signal (no opportunity to shut down).
This is a beta field and requires enabling ProbeTerminationGracePeriod feature gate.
Minimum value is 1. spec.terminationGracePeriodSeconds is used if unset.
format:
int64
timeoutSeconds
integer
Number of seconds after which the probe times out.
Defaults to 1 second. Minimum value is 1.
More info: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle#container-probes
format:
int32
reasoningBudget
integer
ReasoningBudget caps the number of reasoning tokens the model is allowed to
emit per response. Zero disables visible thinking output entirely; the model
still reasons internally but does not emit thinking tokens. Critical for
production agentic workloads on thinking models (Qwen 3.6, GLM-5) where
runaway reasoning can burn compute.
Maps to llama.cpp --reasoning-budget flag.
format:
int32minimum:
0
reasoningBudgetMessage
string
ReasoningBudgetMessage is injected when the reasoning budget is exhausted,
forcing the model to conclude. Ignored unless ReasoningBudget is also set.
Maps to llama.cpp --reasoning-budget-message flag.
replicas
integer
Replicas is the desired number of inference pods
format:
int32minimum:
0maximum:
10resources object
Resources defines compute resources for inference pods
cpu
string
CPU requests (e.g., "2" or "2000m")
gpu
integer
GPU count required per pod
For multi-GPU inference, each pod gets this many GPUs
Note: Multi-GPU sharding config comes from Model CRD
format:
int32minimum:
0maximum:
8
gpuMemory
string
GPUMemory specifies GPU memory limit per pod (e.g., "16Gi")
Used for scheduling and validation
hostMemory
string
HostMemory specifies the system RAM required for hybrid GPU/CPU offloading (e.g., "64Gi").
Used when MoE expert weights or KV cache are offloaded to CPU via moeCPUOffload or noKvOffload.
Translated to pod resources.requests.memory, taking precedence over Memory when set.
Without this, the K8s scheduler has no visibility into the pod's actual RAM consumption,
which can lead to OOM kills after model load.
memory
string
Memory requests (e.g., "4Gi")
ropeScaling object
RopeScaling configures RoPE-based context extension so a model can be
served past its native trained context (e.g. 128K served at 256K via
YaRN). For the llamacpp runtime this maps to --rope-scaling /
--rope-scale / --yarn-orig-ctx. Prefer this over raw spec.extraArgs:
it is validated and discoverable via `kubectl explain`. If --rope-scaling
is also present in spec.extraArgs, extraArgs wins and this is skipped.
factor
string
Factor is the scale multiplier (--rope-scale), e.g. "2.0" to double the
native context. A string to avoid CRD float pitfalls; the runtime parses
it as a float. Optional.
pattern:
^[0-9]+(\.[0-9]+)?$
originalContext
integer
OriginalContext is the model's native training context length
(--yarn-orig-ctx), e.g. 131072 for a 128K model. Recommended with yarn.
format:
int32minimum:
128
type
string required
Type is the scaling method (--rope-scaling). "yarn" is the usual choice
for extending context (e.g. 128K to 256K).
enum:
linear, yarn, longrope
runtime
string
Runtime selects the inference server backend.
"llamacpp" (default): llama.cpp server with auto-generated args and /health probes.
"generic": user-provided container with custom command, args, env, and probes.
"personaplex": NVIDIA PersonaPlex (Moshi) speech-to-speech server.
"vllm": vLLM OpenAI-compatible server with PagedAttention.
"tgi": HuggingFace Text Generation Inference server.
enum:
llamacpp, personaplex, vllm, tgi, generic
runtimeClassName
string
RuntimeClassName selects a Kubernetes RuntimeClass for the inference Pod.
Most commonly set to "nvidia" on clusters where the NVIDIA Container
Runtime is not configured as the cluster default. Without it, GPU pods
schedule onto the GPU node but never get the device files bind-mounted,
and the container fails at runtime with "no CUDA-capable device is
detected". Maps directly to PodSpec.RuntimeClassName.
Most clusters running the NVIDIA GPU Operator with the default toolkit
env do not need this set; it is a safety hatch for clusters where the
runtime configuration is non-default.
securityContext object
SecurityContext defines container-level security attributes for the inference container.
allowPrivilegeEscalation
boolean
AllowPrivilegeEscalation controls whether a process can gain more
privileges than its parent process. This bool directly controls if
the no_new_privs flag will be set on the container process.
AllowPrivilegeEscalation is true always when the container is:
1) run as Privileged
2) has CAP_SYS_ADMIN
Note that this field cannot be set when spec.os.name is windows.
appArmorProfile object
appArmorProfile is the AppArmor options to use by this container. If set, this profile
overrides the pod's appArmorProfile.
Note that this field cannot be set when spec.os.name is windows.
localhostProfile
string
localhostProfile indicates a profile loaded on the node that should be used.
The profile must be preconfigured on the node to work.
Must match the loaded name of the profile.
Must be set if and only if type is "Localhost".
type
string required
type indicates which kind of AppArmor profile will be applied.
Valid options are:
Localhost - a profile pre-loaded on the node.
RuntimeDefault - the container runtime's default profile.
Unconfined - no AppArmor enforcement.
capabilities object
The capabilities to add/drop when running containers.
Defaults to the default set of capabilities granted by the container runtime.
Note that this field cannot be set when spec.os.name is windows.
add
[]string
Added capabilities
drop
[]string
Removed capabilities
privileged
boolean
Run container in privileged mode.
Processes in privileged containers are essentially equivalent to root on the host.
Defaults to false.
Note that this field cannot be set when spec.os.name is windows.
procMount
string
procMount denotes the type of proc mount to use for the containers.
The default value is Default which uses the container runtime defaults for
readonly paths and masked paths.
Note that this field cannot be set when spec.os.name is windows.
readOnlyRootFilesystem
boolean
Whether this container has a read-only root filesystem.
Default is false.
Note that this field cannot be set when spec.os.name is windows.
runAsGroup
integer
The GID to run the entrypoint of the container process.
Uses runtime default if unset.
May also be set in PodSecurityContext. If set in both SecurityContext and
PodSecurityContext, the value specified in SecurityContext takes precedence.
Note that this field cannot be set when spec.os.name is windows.
format:
int64
runAsNonRoot
boolean
Indicates that the container must run as a non-root user.
If true, the Kubelet will validate the image at runtime to ensure that it
does not run as UID 0 (root) and fail to start the container if it does.
If unset or false, no such validation will be performed.
May also be set in PodSecurityContext. If set in both SecurityContext and
PodSecurityContext, the value specified in SecurityContext takes precedence.
runAsUser
integer
The UID to run the entrypoint of the container process.
Defaults to user specified in image metadata if unspecified.
May also be set in PodSecurityContext. If set in both SecurityContext and
PodSecurityContext, the value specified in SecurityContext takes precedence.
Note that this field cannot be set when spec.os.name is windows.
format:
int64seLinuxOptions object
The SELinux context to be applied to the container.
If unspecified, the container runtime will allocate a random SELinux context for each
container. May also be set in PodSecurityContext. If set in both SecurityContext and
PodSecurityContext, the value specified in SecurityContext takes precedence.
Note that this field cannot be set when spec.os.name is windows.
level
string
Level is SELinux level label that applies to the container.
role
string
Role is a SELinux role label that applies to the container.
type
string
Type is a SELinux type label that applies to the container.
user
string
User is a SELinux user label that applies to the container.
seccompProfile object
The seccomp options to use by this container. If seccomp options are
provided at both the pod & container level, the container options
override the pod options.
Note that this field cannot be set when spec.os.name is windows.
localhostProfile
string
localhostProfile indicates a profile defined in a file on the node should be used.
The profile must be preconfigured on the node to work.
Must be a descending path, relative to the kubelet's configured seccomp profile location.
Must be set if type is "Localhost". Must NOT be set for any other type.
type
string required
type indicates which kind of seccomp profile will be applied.
Valid options are:
Localhost - a profile defined in a file on the node should be used.
RuntimeDefault - the container runtime default profile should be used.
Unconfined - no profile should be applied.
windowsOptions object
The Windows specific settings applied to all containers.
If unspecified, the options from the PodSecurityContext will be used.
If set in both SecurityContext and PodSecurityContext, the value specified in SecurityContext takes precedence.
Note that this field cannot be set when spec.os.name is linux.
gmsaCredentialSpec
string
GMSACredentialSpec is where the GMSA admission webhook
(https://github.com/kubernetes-sigs/windows-gmsa) inlines the contents of the
GMSA credential spec named by the GMSACredentialSpecName field.
gmsaCredentialSpecName
string
GMSACredentialSpecName is the name of the GMSA credential spec to use.
hostProcess
boolean
HostProcess determines if a container should be run as a 'Host Process' container.
All of a Pod's containers must have the same effective HostProcess value
(it is not allowed to have a mix of HostProcess containers and non-HostProcess containers).
In addition, if HostProcess is true then HostNetwork must also be set to true.
runAsUserName
string
The UserName in Windows to run the entrypoint of the container process.
Defaults to the user specified in image metadata if unspecified.
May also be set in PodSecurityContext. If set in both SecurityContext and
PodSecurityContext, the value specified in SecurityContext takes precedence.
skipModelInit
boolean
SkipModelInit disables the model-downloader init container.
Use when the model is baked into the image or downloaded by the
container itself (e.g., via HF_TOKEN).
tensorOverrides
[]string
TensorOverrides provides fine-grained tensor placement overrides for power users.
Each entry specifies a tensor name and target device (e.g., "exps=CPU", "token_embd=CUDA0").
Maps to llama.cpp --override-tensor flag (one flag per entry).
tgiConfig object
TGIConfig holds configuration for the TGI runtime.
Only used when Runtime is "tgi".
dtype
string
Dtype sets the model data type (float16, bfloat16).
enum:
float16, bfloat16hfTokenSecretRef object
HFTokenSecretRef references a Secret containing the HuggingFace token.
key
string required
The key of the secret to select from. Must be a valid secret key.
name
string
Name of the referent.
This field is effectively required, but due to backwards compatibility is
allowed to be empty. Instances of this type with an empty value here are
almost certainly wrong.
More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#names
optional
boolean
Specify whether the Secret or its key must be defined
maxInputLength
integer
MaxInputLength sets the maximum input token length.
format:
int32
maxTotalTokens
integer
MaxTotalTokens sets the maximum total tokens (input + output).
format:
int32
quantize
string
Quantize sets the quantization method (bitsandbytes, gptq, awq, eetq).
enum:
bitsandbytes, gptq, awq, eetqtolerations []object
Tolerations for pod scheduling (e.g., GPU taints, spot instances)
effect
string
Effect indicates the taint effect to match. Empty means match all taint effects.
When specified, allowed values are NoSchedule, PreferNoSchedule and NoExecute.
key
string
Key is the taint key that the toleration applies to. Empty means match all taint keys.
If the key is empty, operator must be Exists; this combination means to match all values and all keys.
operator
string
Operator represents a key's relationship to the value.
Valid operators are Exists, Equal, Lt, and Gt. Defaults to Equal.
Exists is equivalent to wildcard for value, so that a pod can
tolerate all taints of a particular category.
Lt and Gt perform numeric comparisons (requires feature gate TaintTolerationComparisonOperators).
tolerationSeconds
integer
TolerationSeconds represents the period of time the toleration (which must be
of effect NoExecute, otherwise this field is ignored) tolerates the taint. By default,
it is not set, which means tolerate the taint forever (do not evict). Zero and
negative values will be treated as 0 (evict immediately) by the system.
format:
int64
value
string
Value is the taint value the toleration matches to.
If the operator is Exists, the value should be empty, otherwise just a regular string.
uBatchSize
integer
UBatchSize sets the micro-batch size for decoding.
Smaller micro-batches reduce memory usage during generation.
Maps to llama.cpp --ubatch-size flag.
format:
int32minimum:
1vllmConfig object
VLLMConfig holds configuration for the vLLM runtime.
Only used when Runtime is "vllm".
attentionBackend
string
AttentionBackend selects the attention implementation used by vLLM.
FLASHINFER is typically fastest on recent NVIDIA GPUs (especially Blackwell);
FLASH_ATTN is a solid default; XFORMERS and torch_sdpa are portability
fallbacks. Requires a vLLM version that supports the chosen backend.
Both uppercase (vLLM's native form) and lowercase spellings are accepted
for backwards compatibility with earlier LLMKube releases.
Maps to vLLM --attention-backend flag.
enum:
FLASH_ATTN, FLASHINFER, XFORMERS, flashinfer, flash_attn, xformers, torch_sdpa
cpuOffloadGB
integer
CPUOffloadGB increases the GPU memory size. When set, passes
--cpu-offload-gb to vLLM. Per-rank, so 4 on TP=2 means 4 GB of
CPU RAM per GPU. Use when FP8 model weights don't fit VRAM.
Throughput hit is 2-5x on the offloaded path.
format:
int32minimum:
0
dtype
string
Dtype sets the model data type (auto, float16, bfloat16).
enum:
auto, float16, bfloat16
enableChunkedPrefill
boolean
EnableChunkedPrefill interleaves long prefills with decode steps so a
large paste (e.g. a 32K-token file) does not starve concurrent decode
streams. Only emitted when explicitly set to true.
Maps to vLLM --enable-chunked-prefill flag.
enableExpertParallel
boolean
EnableExpertParallel distributes MoE experts across tensor-parallel ranks
instead of replicating them. Only meaningful for MoE models.
Maps to vLLM --enable-expert-parallel flag.
enablePrefixCaching
boolean
EnablePrefixCaching turns on vLLM's automatic prefix caching for repeated prompts.
Significantly reduces time-to-first-token for conversational and agentic workloads
where requests share a common system prompt.
Only emitted when explicitly set to true — when nil or false, vLLM's own
default is used (do not emit the flag).
Maps to vLLM --enable-prefix-caching flag.
gpuMemoryUtilization
number
GPUMemoryUtilization controls how much GPU memory each stage can use.
When set, passes --gpu-memory-utilization to vLLM. Range from 0.1 - 0.99
and default unset (vLLM uses 0.90).
minimum:
0.1maximum:
0.99hfTokenSecretRef object
HFTokenSecretRef references a Secret containing the HuggingFace token.
key
string required
The key of the secret to select from. Must be a valid secret key.
name
string
Name of the referent.
This field is effectively required, but due to backwards compatibility is
allowed to be empty. Instances of this type with an empty value here are
almost certainly wrong.
More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#names
optional
boolean
Specify whether the Secret or its key must be defined
kvCacheCustomDtype
string
KVCacheCustomDtype sets a custom vLLM KV cache element type that is not
in the standard enum. Used for vLLM versions with additional cache
formats such as TurboQuant 2-bit (turbo2, shipped in v0.20.0). Maps to
vLLM --kv-cache-dtype. The runtime image must understand the value or
vLLM will fail to start; LLMKube does not validate the string. Mirrors
the llama.cpp-side CacheTypeCustomK/V escape hatch.
Takes precedence over KVCacheDtype when both are set.
kvCacheDtype
string
KVCacheDtype selects the KV cache element type. fp8_e5m2 and fp8_e4m3 cut
KV cache memory roughly in half versus auto (which follows dtype), which
is what unlocks 128K+ context on consumer VRAM for agentic workloads.
Maps to vLLM --kv-cache-dtype flag.
For custom build types not in the enum (e.g. TurboQuant turbo2 from
vLLM v0.20+), use KVCacheCustomDtype instead.
enum:
auto, fp8_e5m2, fp8_e4m3
maxModelLen
integer
MaxModelLen sets the maximum model context length.
format:
int32
maxNumBatchedTokens
integer
MaxNumBatchedTokens sets the maximum number of tokens batched together
per step. This is the main throughput knob: too low means prefill-bound,
too high risks OOM on long context. No default — only emitted when set.
Maps to vLLM --max-num-batched-tokens flag.
format:
int32minimum:
512
quantization
string
Quantization method.
awq, gptq, squeezellm are classic 4-bit formats. fp8 targets 8-bit FP
checkpoints (Qwen FP8, Llama FP8, etc.). nvfp4 is NVIDIA's Blackwell-native
4-bit format. compressed-tensors is the neuralmagic/vLLM cross-format
loader used by Unsloth and other recent releases.
enum:
awq, gptq, squeezellm, fp8, nvfp4, compressed-tensorsspeculative object
Speculative enables draft-model speculative decoding. On single-stream
agentic workloads this can be 30-60% faster than plain tensor-parallel
execution. Requires a second (smaller) Model CR to act as the draft.
enabled
boolean
Enabled toggles speculative decoding on. When false or nil, no
speculative flags are emitted regardless of other fields.
model
string
Model references the Model CR (in the same namespace as the
InferenceService) to use as the speculative draft model.
Required when Enabled is true. If missing, speculative decoding is
skipped and the InferenceService surfaces a SpeculativeInvalid
status condition rather than failing the reconcile.
Maps to vLLM --speculative-model flag.
numSpeculativeTokens
integer
NumSpeculativeTokens is the number of draft tokens proposed per step.
Typical sweet spot is 3-5; higher values increase wasted work when the
draft disagrees with the target model.
Maps to vLLM --num-speculative-tokens flag.
format:
int32minimum:
1maximum:
16
tensorParallelSize
integer
TensorParallelSize sets the number of GPUs for tensor parallelism.
format:
int32status object
status defines the observed state of InferenceService
conditions []object
conditions represent the current state of the InferenceService resource.
Each condition has a unique type and reflects the status of a specific aspect of the resource.
Standard condition types include:
- "Available": the resource is fully functional
- "Progressing": the resource is being created or updated
- "Degraded": the resource failed to reach or maintain its desired state
The status of each condition is one of True, False, or Unknown.
lastTransitionTime
string required
lastTransitionTime is the last time the condition transitioned from one status to another.
This should be when the underlying condition changed. If that is not known, then using the time when the API field changed is acceptable.
format:
date-time
message
string required
message is a human readable message indicating details about the transition.
This may be an empty string.
maxLength:
32768
observedGeneration
integer
observedGeneration represents the .metadata.generation that the condition was set based upon.
For instance, if .metadata.generation is currently 12, but the .status.conditions[x].observedGeneration is 9, the condition is out of date
with respect to the current state of the instance.
format:
int64minimum:
0
reason
string required
reason contains a programmatic identifier indicating the reason for the condition's last transition.
Producers of specific condition types may define expected values and meanings for this field,
and whether the values are considered a guaranteed API.
The value should be a CamelCase string.
This field may not be empty.
pattern:
^[A-Za-z]([A-Za-z0-9_,:]*[A-Za-z0-9_])?$minLength:
1maxLength:
1024
status
string required
status of the condition, one of True, False, Unknown.
enum:
True, False, Unknown
type
string required
type of condition in CamelCase or in foo.example.com/CamelCase.
pattern:
^([a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*/)?(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])$maxLength:
316
desiredReplicas
integer
DesiredReplicas is the desired number of replicas
format:
int32
effectivePriority
integer
EffectivePriority shows the resolved priority value from the applied PriorityClass
format:
int32
endpoint
string
Endpoint is the service URL where inference requests can be sent
lastUpdated
string
LastUpdated is the timestamp of the last status update
format:
date-time
modelReady
boolean
ModelReady indicates if the referenced Model is in Ready state
phase
string
Phase represents the current lifecycle phase of the InferenceService.
Possible values: Pending, Creating, Progressing, Ready, WaitingForGPU,
Stopped, Failed. Stopped is the terminal state when spec.replicas=0
has caused the agent to tear down the workload; tooling polling for
readiness should treat Stopped the same as Pending (the user
intentionally took the service offline; this is not an error).
enum:
Pending, Creating, Progressing, Ready, WaitingForGPU, Stopped, Failed
queuePosition
integer
QueuePosition indicates position among pending InferenceServices cluster-wide (0 = not queued)
format:
int32
readyReplicas
integer
Replicas tracks the number of ready vs desired pods
format:
int32
replicas
integer
Replicas is the current number of running inference pods
format:
int32
schedulingMessage
string
SchedulingMessage provides details about scheduling issues
schedulingStatus
string
SchedulingStatus indicates why pods cannot be scheduled (e.g., "InsufficientGPU")
waitingFor
string
WaitingFor describes the resource constraint (e.g., "nvidia.com/gpu: 1")
No matches. Try .spec.args for an exact path