InferenceService

inference.llmkube.dev / v1alpha1

apiVersion: inference.llmkube.dev/v1alpha1 kind: InferenceService metadata: name: example

apiVersion string

APIVersion defines the versioned schema of this representation of an object. Servers should convert recognized schemas to the latest internal value, and may reject unrecognized values. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources

kind string

Kind is a string value representing the REST resource this object represents. Servers may infer this from the endpoint the client submits requests to. Cannot be updated. In CamelCase. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds

metadata object

spec object required

spec defines the desired state of InferenceService

affinity object

Affinity constrains inference Pod placement (node/pod affinity and anti-affinity). Passthrough to the Pod spec, for finer control than NodeSelector — e.g. preferring or avoiding nodes already running other GPU workloads.

nodeAffinity object

Describes node affinity scheduling rules for the pod.

preferredDuringSchedulingIgnoredDuringExecution []object

The scheduler will prefer to schedule pods to nodes that satisfy the affinity expressions specified by this field, but it may choose a node that violates one or more of the expressions. The node that is most preferred is the one with the greatest sum of weights, i.e. for each node that meets all of the scheduling requirements (resource request, requiredDuringScheduling affinity expressions, etc.), compute a sum by iterating through the elements of this field and adding "weight" to the sum if the node matches the corresponding matchExpressions; the node(s) with the highest sum are the most preferred.

preference object required

A node selector term, associated with the corresponding weight.

matchExpressions []object

A list of node selector requirements by node's labels.

key string required

The label key that the selector applies to.

operator string required

Represents a key's relationship to a set of values. Valid operators are In, NotIn, Exists, DoesNotExist. Gt, and Lt.

values []string

An array of string values. If the operator is In or NotIn, the values array must be non-empty. If the operator is Exists or DoesNotExist, the values array must be empty. If the operator is Gt or Lt, the values array must have a single element, which will be interpreted as an integer. This array is replaced during a strategic merge patch.

matchFields []object

A list of node selector requirements by node's fields.

key string required

The label key that the selector applies to.

operator string required

Represents a key's relationship to a set of values. Valid operators are In, NotIn, Exists, DoesNotExist. Gt, and Lt.

values []string

weight integer required

Weight associated with matching the corresponding nodeSelectorTerm, in the range 1-100.

format: int32

requiredDuringSchedulingIgnoredDuringExecution object

nodeSelectorTerms []object required

Required. A list of node selector terms. The terms are ORed.

matchExpressions []object

A list of node selector requirements by node's labels.

key string required

The label key that the selector applies to.

operator string required

Represents a key's relationship to a set of values. Valid operators are In, NotIn, Exists, DoesNotExist. Gt, and Lt.

values []string

matchFields []object

A list of node selector requirements by node's fields.

key string required

The label key that the selector applies to.

operator string required

Represents a key's relationship to a set of values. Valid operators are In, NotIn, Exists, DoesNotExist. Gt, and Lt.

values []string

podAffinity object

Describes pod affinity scheduling rules (e.g. co-locate this pod in the same node, zone, etc. as some other pod(s)).

preferredDuringSchedulingIgnoredDuringExecution []object

The scheduler will prefer to schedule pods to nodes that satisfy the affinity expressions specified by this field, but it may choose a node that violates one or more of the expressions. The node that is most preferred is the one with the greatest sum of weights, i.e. for each node that meets all of the scheduling requirements (resource request, requiredDuringScheduling affinity expressions, etc.), compute a sum by iterating through the elements of this field and adding "weight" to the sum if the node has pods which matches the corresponding podAffinityTerm; the node(s) with the highest sum are the most preferred.

podAffinityTerm object required

Required. A pod affinity term, associated with the corresponding weight.

labelSelector object

A label query over a set of resources, in this case pods. If it's null, this PodAffinityTerm matches with no Pods.

matchExpressions []object

matchExpressions is a list of label selector requirements. The requirements are ANDed.

key string required

key is the label key that the selector applies to.

operator string required

operator represents a key's relationship to a set of values. Valid operators are In, NotIn, Exists and DoesNotExist.

values []string

values is an array of string values. If the operator is In or NotIn, the values array must be non-empty. If the operator is Exists or DoesNotExist, the values array must be empty. This array is replaced during a strategic merge patch.

matchLabels object

matchLabels is a map of {key,value} pairs. A single {key,value} in the matchLabels map is equivalent to an element of matchExpressions, whose key field is "key", the operator is "In", and the values array contains only "value". The requirements are ANDed.

matchLabelKeys []string

MatchLabelKeys is a set of pod label keys to select which pods will be taken into consideration. The keys are used to lookup values from the incoming pod labels, those key-value labels are merged with `labelSelector` as `key in (value)` to select the group of existing pods which pods will be taken into consideration for the incoming pod's pod (anti) affinity. Keys that don't exist in the incoming pod labels will be ignored. The default value is empty. The same key is forbidden to exist in both matchLabelKeys and labelSelector. Also, matchLabelKeys cannot be set when labelSelector isn't set.

mismatchLabelKeys []string

MismatchLabelKeys is a set of pod label keys to select which pods will be taken into consideration. The keys are used to lookup values from the incoming pod labels, those key-value labels are merged with `labelSelector` as `key notin (value)` to select the group of existing pods which pods will be taken into consideration for the incoming pod's pod (anti) affinity. Keys that don't exist in the incoming pod labels will be ignored. The default value is empty. The same key is forbidden to exist in both mismatchLabelKeys and labelSelector. Also, mismatchLabelKeys cannot be set when labelSelector isn't set.

namespaceSelector object

A label query over the set of namespaces that the term applies to. The term is applied to the union of the namespaces selected by this field and the ones listed in the namespaces field. null selector and null or empty namespaces list means "this pod's namespace". An empty selector ({}) matches all namespaces.

matchExpressions []object

matchExpressions is a list of label selector requirements. The requirements are ANDed.

key string required

key is the label key that the selector applies to.

operator string required

operator represents a key's relationship to a set of values. Valid operators are In, NotIn, Exists and DoesNotExist.

values []string

matchLabels object

namespaces []string

namespaces specifies a static list of namespace names that the term applies to. The term is applied to the union of the namespaces listed in this field and the ones selected by namespaceSelector. null or empty namespaces list and null namespaceSelector means "this pod's namespace".

topologyKey string required

This pod should be co-located (affinity) or not co-located (anti-affinity) with the pods matching the labelSelector in the specified namespaces, where co-located is defined as running on a node whose value of the label with key topologyKey matches that of any node on which any of the selected pods is running. Empty topologyKey is not allowed.

weight integer required

weight associated with matching the corresponding podAffinityTerm, in the range 1-100.

format: int32

requiredDuringSchedulingIgnoredDuringExecution []object

If the affinity requirements specified by this field are not met at scheduling time, the pod will not be scheduled onto the node. If the affinity requirements specified by this field cease to be met at some point during pod execution (e.g. due to a pod label update), the system may or may not try to eventually evict the pod from its node. When there are multiple elements, the lists of nodes corresponding to each podAffinityTerm are intersected, i.e. all terms must be satisfied.

labelSelector object

A label query over a set of resources, in this case pods. If it's null, this PodAffinityTerm matches with no Pods.

matchExpressions []object

matchExpressions is a list of label selector requirements. The requirements are ANDed.

key string required

key is the label key that the selector applies to.

operator string required

operator represents a key's relationship to a set of values. Valid operators are In, NotIn, Exists and DoesNotExist.

values []string

matchLabels object

matchLabelKeys []string

mismatchLabelKeys []string

namespaceSelector object

matchExpressions []object

matchExpressions is a list of label selector requirements. The requirements are ANDed.

key string required

key is the label key that the selector applies to.

operator string required

operator represents a key's relationship to a set of values. Valid operators are In, NotIn, Exists and DoesNotExist.

values []string

matchLabels object

namespaces []string

topologyKey string required

podAntiAffinity object

Describes pod anti-affinity scheduling rules (e.g. avoid putting this pod in the same node, zone, etc. as some other pod(s)).

preferredDuringSchedulingIgnoredDuringExecution []object

The scheduler will prefer to schedule pods to nodes that satisfy the anti-affinity expressions specified by this field, but it may choose a node that violates one or more of the expressions. The node that is most preferred is the one with the greatest sum of weights, i.e. for each node that meets all of the scheduling requirements (resource request, requiredDuringScheduling anti-affinity expressions, etc.), compute a sum by iterating through the elements of this field and subtracting "weight" from the sum if the node has pods which matches the corresponding podAffinityTerm; the node(s) with the highest sum are the most preferred.

podAffinityTerm object required

Required. A pod affinity term, associated with the corresponding weight.

labelSelector object

A label query over a set of resources, in this case pods. If it's null, this PodAffinityTerm matches with no Pods.

matchExpressions []object

matchExpressions is a list of label selector requirements. The requirements are ANDed.

key string required

key is the label key that the selector applies to.

operator string required

operator represents a key's relationship to a set of values. Valid operators are In, NotIn, Exists and DoesNotExist.

values []string

matchLabels object

matchLabelKeys []string

mismatchLabelKeys []string

namespaceSelector object

matchExpressions []object

matchExpressions is a list of label selector requirements. The requirements are ANDed.

key string required

key is the label key that the selector applies to.

operator string required

operator represents a key's relationship to a set of values. Valid operators are In, NotIn, Exists and DoesNotExist.

values []string

matchLabels object

namespaces []string

topologyKey string required

weight integer required

weight associated with matching the corresponding podAffinityTerm, in the range 1-100.

format: int32

requiredDuringSchedulingIgnoredDuringExecution []object

If the anti-affinity requirements specified by this field are not met at scheduling time, the pod will not be scheduled onto the node. If the anti-affinity requirements specified by this field cease to be met at some point during pod execution (e.g. due to a pod label update), the system may or may not try to eventually evict the pod from its node. When there are multiple elements, the lists of nodes corresponding to each podAffinityTerm are intersected, i.e. all terms must be satisfied.

labelSelector object

A label query over a set of resources, in this case pods. If it's null, this PodAffinityTerm matches with no Pods.

matchExpressions []object

matchExpressions is a list of label selector requirements. The requirements are ANDed.

key string required

key is the label key that the selector applies to.

operator string required

operator represents a key's relationship to a set of values. Valid operators are In, NotIn, Exists and DoesNotExist.

values []string

matchLabels object

matchLabelKeys []string

mismatchLabelKeys []string

namespaceSelector object

matchExpressions []object

matchExpressions is a list of label selector requirements. The requirements are ANDed.

key string required

key is the label key that the selector applies to.

operator string required

operator represents a key's relationship to a set of values. Valid operators are In, NotIn, Exists and DoesNotExist.

values []string

matchLabels object

namespaces []string

topologyKey string required

args []string

Args overrides the container arguments entirely. Only used when Runtime is "generic". For llamacpp, use ExtraArgs instead.

autoscaling object

Autoscaling configures horizontal pod autoscaling for the inference service. When set, the controller creates and manages an HPA resource targeting the inference Deployment. Requires Prometheus Adapter for custom metrics. Mutually exclusive with manual replica management: when autoscaling is enabled, the Replicas field serves as the initial replica count only.

maxReplicas integer required

MaxReplicas is the upper limit for the number of replicas.

format: int32

minimum: 1

maximum: 100

metrics []object

Metrics defines the scaling metrics and target values. If empty, defaults to llamacpp:requests_processing with target average value of 2.

name string required

Name is the metric name (e.g., llamacpp:requests_processing).

targetAverageUtilization integer

TargetAverageUtilization is the target utilization percentage for Resource-type metrics.

format: int32

targetAverageValue string

TargetAverageValue is the target per-pod average for Pods-type metrics.

type string required

Type is the metric source type.

enum: Pods, Resource

minReplicas integer

MinReplicas is the lower limit for the number of replicas.

format: int32

minimum: 1

maximum: 10

batchSize integer

BatchSize sets the token batch size for prompt processing. Larger values improve throughput but use more memory. Maps to llama.cpp --batch-size flag.

format: int32

minimum: 1

maximum: 16384

cacheTypeCustomK string

CacheTypeCustomK sets a custom KV cache type for keys that is not in the standard enum. Used for llama.cpp forks with additional cache formats such as TurboQuant (turbo3, turbo4, tbqp3, etc.). Maps to llama.cpp --cache-type-k. The runtime binary must understand the value or llama-server will fail to start; LLMKube does not validate the string. Takes precedence over CacheTypeK when both are set.

cacheTypeCustomV string

CacheTypeCustomV sets a custom KV cache type for values that is not in the standard enum. See CacheTypeCustomK for usage notes. Takes precedence over CacheTypeV when both are set.

cacheTypeK string

CacheTypeK sets the KV cache quantization type for keys. Supported values depend on the llama.cpp build version. Maps to llama.cpp --cache-type-k flag. Default: f16 (llama.cpp default). For custom build types not in the enum (e.g. TurboQuant turbo3, tbqp3), use CacheTypeCustomK instead.

enum: f16, f32, q8_0, q4_0, q4_1, q5_0, q5_1, iq4_nl

cacheTypeV string

CacheTypeV sets the KV cache quantization type for values. Maps to llama.cpp --cache-type-v flag. Default: f16 (llama.cpp default). For custom build types not in the enum (e.g. TurboQuant turbo3, tbqp3), use CacheTypeCustomV instead.

enum: f16, f32, q8_0, q4_0, q4_1, q5_0, q5_1, iq4_nl

command []string

Command overrides the container entrypoint. Only used when Runtime is "generic" or for advanced customization.

containerPort integer

ContainerPort overrides the primary container port. Each runtime has its own default (llamacpp: 8080).

format: int32

minimum: 1

maximum: 65535

contextSize integer

ContextSize sets the context window size for the llama.cpp server (-c flag). Larger values allow processing longer inputs but require more memory. If not specified, llama.cpp uses its default (typically 512 or 2048). The upper bound covers Qwen 3.6 at 1M-via-YaRN with margin and accommodates near-future hybrid-attention model architectures. KV cache memory is the user's responsibility to size via spec.resources.memory or hostMemory.

format: int32

minimum: 128

maximum: 2.097152e+06

disruption object

Disruption controls how the operator manages node-disruption annotations on inference pods during the vulnerable startup window (model download + load). When ProtectStartup is true (the default), the operator sets karpenter.sh/do-not-disrupt: "true" on the pod template while the InferenceService is not yet Ready, then removes it once the service reaches the Ready phase. Set ProtectAlways to true to keep the annotation permanently (equivalent to setting it via podAnnotations). User-provided podAnnotations always win on collision.

protectAlways boolean

ProtectAlways keeps the disruption-protection annotation on the pod template permanently, regardless of the InferenceService phase. This is equivalent to setting karpenter.sh/do-not-disrupt: "true" via podAnnotations, but managed by the operator. Defaults to false.

protectStartup boolean

ProtectStartup prevents node disruption (e.g., Karpenter consolidation, Cluster Autoscaler scale-down) while the InferenceService is starting up. When true, the operator sets karpenter.sh/do-not-disrupt: "true" on the pod template until the InferenceService reaches the Ready phase, then removes it. Defaults to true.

endpoint object

Endpoint defines the service endpoint configuration

gateway object

Gateway opts this InferenceService into Envoy AI Gateway exposure. When set and Enabled, the operator generates the Backend / AIServiceBackend / AIGatewayRoute resources that front this service through a pre-installed Envoy AI Gateway. nil (the default) preserves today's behavior (no gateway resources). The Envoy AI Gateway stack and the referenced Gateway are a documented prerequisite; LLMKube does not install or own them.

enabled boolean

Enabled is the opt-in switch. When false (or when Gateway is nil), the operator generates no gateway resources for this InferenceService.

gatewayRef object required

GatewayRef identifies the pre-installed Gateway (gateway.networking.k8s.io) the generated AIGatewayRoute attaches to. The Gateway typically lives in a dedicated gateway namespace; cross-namespace attachment requires the Gateway listener's allowedRoutes.namespaces to permit this InferenceService's namespace (a documented prerequisite for the MVP; the operator does not generate ReferenceGrants or touch the listener).

name string required

Name is the Gateway's name.

namespace string

Namespace is the Gateway's namespace. Empty means the InferenceService's own namespace.

modelName string

ModelName is the OpenAI "model" string clients send, matched by the generated route rule (the x-ai-eg-model header the gateway's ext_proc populates from the request body). Defaults to ModelRef, falling back to the InferenceService name when ModelRef is empty.

nodePort integer

NodePort is the specific NodePort to pin when endpoint.type is NodePort. If set, the Service will use this exact port instead of auto-assigning from the 30000-32767 range. This provides a stable external endpoint across redeployments.

format: int32

minimum: 30000

maximum: 32767

path string

Path is the HTTP path for the inference endpoint

port integer

Port is the service port

format: int32

minimum: 1

maximum: 65535

type string

Type is the Kubernetes service type (ClusterIP, NodePort, LoadBalancer)

enum: ClusterIP, NodePort, LoadBalancer

env []object

Env adds environment variables to the inference container. Useful for HF_TOKEN, custom runtime config, etc.

name string required

Name of the environment variable. May consist of any printable ASCII characters except '='.

value string

Variable references $(VAR_NAME) are expanded using the previously defined environment variables in the container and any service environment variables. If a variable cannot be resolved, the reference in the input string will be unchanged. Double $$ are reduced to a single $, which allows for escaping the $(VAR_NAME) syntax: i.e. "$$(VAR_NAME)" will produce the string literal "$(VAR_NAME)". Escaped references will never be expanded, regardless of whether the variable exists or not. Defaults to "".

valueFrom object

Source for the environment variable's value. Cannot be used if value is not empty.

configMapKeyRef object

Selects a key of a ConfigMap.

key string required

The key to select.

name string

Name of the referent. This field is effectively required, but due to backwards compatibility is allowed to be empty. Instances of this type with an empty value here are almost certainly wrong. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#names

optional boolean

Specify whether the ConfigMap or its key must be defined

fieldRef object

Selects a field of the pod: supports metadata.name, metadata.namespace, `metadata.labels['<KEY>']`, `metadata.annotations['<KEY>']`, spec.nodeName, spec.serviceAccountName, status.hostIP, status.podIP, status.podIPs.

apiVersion string

Version of the schema the FieldPath is written in terms of, defaults to "v1".

fieldPath string required

Path of the field to select in the specified API version.

fileKeyRef object

FileKeyRef selects a key of the env file. Requires the EnvFiles feature gate to be enabled.

key string required

The key within the env file. An invalid key will prevent the pod from starting. The keys defined within a source may consist of any printable ASCII characters except '='. During Alpha stage of the EnvFiles feature gate, the key size is limited to 128 characters.

optional boolean

Specify whether the file or its key must be defined. If the file or key does not exist, then the env var is not published. If optional is set to true and the specified key does not exist, the environment variable will not be set in the Pod's containers. If optional is set to false and the specified key does not exist, an error will be returned during Pod creation.

path string required

The path within the volume from which to select the file. Must be relative and may not contain the '..' path or start with '..'.

volumeName string required

The name of the volume mount containing the env file.

resourceFieldRef object

Selects a resource of the container: only resources limits and requests (limits.cpu, limits.memory, limits.ephemeral-storage, requests.cpu, requests.memory and requests.ephemeral-storage) are currently supported.

containerName string

Container name: required for volumes, optional for env vars

divisor string | integer

Specifies the output format of the exposed resources, defaults to "1"

string pattern: ^(\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))(([KMGTPE]i)|[numkMGTPE]|([eE](\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))))?$

resource string required

Required: resource to select

secretKeyRef object

Selects a key of a secret in the pod's namespace

key string required

The key of the secret to select from. Must be a valid secret key.

name string

optional boolean

Specify whether the Secret or its key must be defined

evictionProtection boolean

EvictionProtection marks this service as ineligible for memory-pressure eviction by the metal-agent watchdog. Use this for production workloads that should never be silently stopped under memory pressure, even when they are the lowest-priority option. The agent's per-process pickEvictionTarget excludes protected processes from the eviction-candidate set; the MemoryPressure status condition is still patched on protected services for operator visibility. Has no effect when --eviction-enabled is unset on the metal-agent or for non-llama-server runtimes (oMLX, Ollama). Defaults to false.

extraArgs []string

ExtraArgs provides additional command-line arguments passed directly to the runtime process. Use for flags not yet supported as typed CRD fields. Arguments are appended after all other configured flags. Supported by the "llamacpp" and "vllm" runtimes. Ignored by others. Example: ["--seed", "42", "--log-disable"]

extraVolumeMounts []object

ExtraVolumeMounts mounts ExtraVolumes into the inference container, appended after the model-storage mounts. Names must match an entry in ExtraVolumes (or a volume from another passthrough field).

mountPath string required

Path within the container at which the volume should be mounted. Must not contain ':'.

mountPropagation string

mountPropagation determines how mounts are propagated from the host to container and the other way around. When not set, MountPropagationNone is used. This field is beta in 1.10. When RecursiveReadOnly is set to IfPossible or to Enabled, MountPropagation must be None or unspecified (which defaults to None).

name string required

This must match the Name of a Volume.

readOnly boolean

Mounted read-only if true, read-write otherwise (false or unspecified). Defaults to false.

recursiveReadOnly string

RecursiveReadOnly specifies whether read-only mounts should be handled recursively. If ReadOnly is false, this field has no meaning and must be unspecified. If ReadOnly is true, and this field is set to Disabled, the mount is not made recursively read-only. If this field is set to IfPossible, the mount is made recursively read-only, if it is supported by the container runtime. If this field is set to Enabled, the mount is made recursively read-only if it is supported by the container runtime, otherwise the pod will not be started and an error will be generated to indicate the reason. If this field is set to IfPossible or Enabled, MountPropagation must be set to None (or be unspecified, which defaults to None). If this field is not specified, it is treated as an equivalent of Disabled.

subPath string

Path within the volume from which the container's volume should be mounted. Defaults to "" (volume's root).

subPathExpr string

Expanded path within the volume from which the container's volume should be mounted. Behaves similarly to SubPath but environment variable references $(VAR_NAME) are expanded using the container's environment. Defaults to "" (volume's root). SubPathExpr and SubPath are mutually exclusive.

extraVolumes []object

ExtraVolumes adds additional Volumes to the inference Pod, appended after the model-storage volumes built from ModelRef. Useful for a runtime-owned cache (e.g. a JIT kernel cache) that is unrelated to model weights and doesn't fit ModelCache's model-scoped PVC path. Pair with ExtraVolumeMounts to actually mount it into the container.

awsElasticBlockStore object

awsElasticBlockStore represents an AWS Disk resource that is attached to a kubelet's host machine and then exposed to the pod. Deprecated: AWSElasticBlockStore is deprecated. All operations for the in-tree awsElasticBlockStore type are redirected to the ebs.csi.aws.com CSI driver. More info: https://kubernetes.io/docs/concepts/storage/volumes#awselasticblockstore

fsType string

fsType is the filesystem type of the volume that you want to mount. Tip: Ensure that the filesystem type is supported by the host operating system. Examples: "ext4", "xfs", "ntfs". Implicitly inferred to be "ext4" if unspecified. More info: https://kubernetes.io/docs/concepts/storage/volumes#awselasticblockstore

partition integer

partition is the partition in the volume that you want to mount. If omitted, the default is to mount by volume name. Examples: For volume /dev/sda1, you specify the partition as "1". Similarly, the volume partition for /dev/sda is "0" (or you can leave the property empty).

format: int32

readOnly boolean

readOnly value true will force the readOnly setting in VolumeMounts. More info: https://kubernetes.io/docs/concepts/storage/volumes#awselasticblockstore

volumeID string required

volumeID is unique ID of the persistent disk resource in AWS (Amazon EBS volume). More info: https://kubernetes.io/docs/concepts/storage/volumes#awselasticblockstore

azureDisk object

azureDisk represents an Azure Data Disk mount on the host and bind mount to the pod. Deprecated: AzureDisk is deprecated. All operations for the in-tree azureDisk type are redirected to the disk.csi.azure.com CSI driver.

cachingMode string

cachingMode is the Host Caching mode: None, Read Only, Read Write.

diskName string required

diskName is the Name of the data disk in the blob storage

diskURI string required

diskURI is the URI of data disk in the blob storage

fsType string

fsType is Filesystem type to mount. Must be a filesystem type supported by the host operating system. Ex. "ext4", "xfs", "ntfs". Implicitly inferred to be "ext4" if unspecified.

kind string

kind expected values are Shared: multiple blob disks per storage account Dedicated: single blob disk per storage account Managed: azure managed data disk (only in managed availability set). defaults to shared

readOnly boolean

readOnly Defaults to false (read/write). ReadOnly here will force the ReadOnly setting in VolumeMounts.

azureFile object

azureFile represents an Azure File Service mount on the host and bind mount to the pod. Deprecated: AzureFile is deprecated. All operations for the in-tree azureFile type are redirected to the file.csi.azure.com CSI driver.

readOnly boolean

readOnly defaults to false (read/write). ReadOnly here will force the ReadOnly setting in VolumeMounts.

secretName string required

secretName is the name of secret that contains Azure Storage Account Name and Key

shareName string required

shareName is the azure share Name

cephfs object

cephFS represents a Ceph FS mount on the host that shares a pod's lifetime. Deprecated: CephFS is deprecated and the in-tree cephfs type is no longer supported.

monitors []string required

monitors is Required: Monitors is a collection of Ceph monitors More info: https://examples.k8s.io/volumes/cephfs/README.md#how-to-use-it

path string

path is Optional: Used as the mounted root, rather than the full Ceph tree, default is /

readOnly boolean

readOnly is Optional: Defaults to false (read/write). ReadOnly here will force the ReadOnly setting in VolumeMounts. More info: https://examples.k8s.io/volumes/cephfs/README.md#how-to-use-it

secretFile string

secretFile is Optional: SecretFile is the path to key ring for User, default is /etc/ceph/user.secret More info: https://examples.k8s.io/volumes/cephfs/README.md#how-to-use-it

secretRef object

secretRef is Optional: SecretRef is reference to the authentication secret for User, default is empty. More info: https://examples.k8s.io/volumes/cephfs/README.md#how-to-use-it

name string

user string

user is optional: User is the rados user name, default is admin More info: https://examples.k8s.io/volumes/cephfs/README.md#how-to-use-it

cinder object

cinder represents a cinder volume attached and mounted on kubelets host machine. Deprecated: Cinder is deprecated. All operations for the in-tree cinder type are redirected to the cinder.csi.openstack.org CSI driver. More info: https://examples.k8s.io/mysql-cinder-pd/README.md

fsType string

fsType is the filesystem type to mount. Must be a filesystem type supported by the host operating system. Examples: "ext4", "xfs", "ntfs". Implicitly inferred to be "ext4" if unspecified. More info: https://examples.k8s.io/mysql-cinder-pd/README.md

readOnly boolean

readOnly defaults to false (read/write). ReadOnly here will force the ReadOnly setting in VolumeMounts. More info: https://examples.k8s.io/mysql-cinder-pd/README.md

secretRef object

secretRef is optional: points to a secret object containing parameters used to connect to OpenStack.

name string

volumeID string required

volumeID used to identify the volume in cinder. More info: https://examples.k8s.io/mysql-cinder-pd/README.md

configMap object

configMap represents a configMap that should populate this volume

defaultMode integer

defaultMode is optional: mode bits used to set permissions on created files by default. Must be an octal value between 0000 and 0777 or a decimal value between 0 and 511. YAML accepts both octal and decimal values, JSON requires decimal values for mode bits. Defaults to 0644. Directories within the path are not affected by this setting. This might be in conflict with other options that affect the file mode, like fsGroup, and the result can be other mode bits set.

format: int32

items []object

items if unspecified, each key-value pair in the Data field of the referenced ConfigMap will be projected into the volume as a file whose name is the key and content is the value. If specified, the listed keys will be projected into the specified paths, and unlisted keys will not be present. If a key is specified which is not present in the ConfigMap, the volume setup will error unless it is marked optional. Paths must be relative and may not contain the '..' path or start with '..'.

key string required

key is the key to project.

mode integer

mode is Optional: mode bits used to set permissions on this file. Must be an octal value between 0000 and 0777 or a decimal value between 0 and 511. YAML accepts both octal and decimal values, JSON requires decimal values for mode bits. If not specified, the volume defaultMode will be used. This might be in conflict with other options that affect the file mode, like fsGroup, and the result can be other mode bits set.

format: int32

path string required

path is the relative path of the file to map the key to. May not be an absolute path. May not contain the path element '..'. May not start with the string '..'.

name string

optional boolean

optional specify whether the ConfigMap or its keys must be defined

csi object

csi (Container Storage Interface) represents ephemeral storage that is handled by certain external CSI drivers.

driver string required

driver is the name of the CSI driver that handles this volume. Consult with your admin for the correct name as registered in the cluster.

fsType string

fsType to mount. Ex. "ext4", "xfs", "ntfs". If not provided, the empty value is passed to the associated CSI driver which will determine the default filesystem to apply.

nodePublishSecretRef object

nodePublishSecretRef is a reference to the secret object containing sensitive information to pass to the CSI driver to complete the CSI NodePublishVolume and NodeUnpublishVolume calls. This field is optional, and may be empty if no secret is required. If the secret object contains more than one secret, all secret references are passed.

name string

readOnly boolean

readOnly specifies a read-only configuration for the volume. Defaults to false (read/write).

volumeAttributes object

volumeAttributes stores driver-specific properties that are passed to the CSI driver. Consult your driver's documentation for supported values.

downwardAPI object

downwardAPI represents downward API about the pod that should populate this volume

defaultMode integer

Optional: mode bits to use on created files by default. Must be a Optional: mode bits used to set permissions on created files by default. Must be an octal value between 0000 and 0777 or a decimal value between 0 and 511. YAML accepts both octal and decimal values, JSON requires decimal values for mode bits. Defaults to 0644. Directories within the path are not affected by this setting. This might be in conflict with other options that affect the file mode, like fsGroup, and the result can be other mode bits set.

format: int32

items []object

Items is a list of downward API volume file

fieldRef object

Required: Selects a field of the pod: only annotations, labels, name, namespace and uid are supported.

apiVersion string

Version of the schema the FieldPath is written in terms of, defaults to "v1".

fieldPath string required

Path of the field to select in the specified API version.

mode integer

Optional: mode bits used to set permissions on this file, must be an octal value between 0000 and 0777 or a decimal value between 0 and 511. YAML accepts both octal and decimal values, JSON requires decimal values for mode bits. If not specified, the volume defaultMode will be used. This might be in conflict with other options that affect the file mode, like fsGroup, and the result can be other mode bits set.

format: int32

path string required

Required: Path is the relative path name of the file to be created. Must not be absolute or contain the '..' path. Must be utf-8 encoded. The first item of the relative path must not start with '..'

resourceFieldRef object

Selects a resource of the container: only resources limits and requests (limits.cpu, limits.memory, requests.cpu and requests.memory) are currently supported.

containerName string

Container name: required for volumes, optional for env vars

divisor string | integer

Specifies the output format of the exposed resources, defaults to "1"

string pattern: ^(\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))(([KMGTPE]i)|[numkMGTPE]|([eE](\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))))?$

resource string required

Required: resource to select

emptyDir object

emptyDir represents a temporary directory that shares a pod's lifetime. More info: https://kubernetes.io/docs/concepts/storage/volumes#emptydir

medium string

medium represents what type of storage medium should back this directory. The default is "" which means to use the node's default medium. Must be an empty string (default) or Memory. More info: https://kubernetes.io/docs/concepts/storage/volumes#emptydir

sizeLimit string | integer

sizeLimit is the total amount of local storage required for this EmptyDir volume. The size limit is also applicable for memory medium. The maximum usage on memory medium EmptyDir would be the minimum value between the SizeLimit specified here and the sum of memory limits of all containers in a pod. The default is nil which means that the limit is undefined. More info: https://kubernetes.io/docs/concepts/storage/volumes#emptydir

string pattern: ^(\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))(([KMGTPE]i)|[numkMGTPE]|([eE](\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))))?$

ephemeral object

ephemeral represents a volume that is handled by a cluster storage driver. The volume's lifecycle is tied to the pod that defines it - it will be created before the pod starts, and deleted when the pod is removed. Use this if: a) the volume is only needed while the pod runs, b) features of normal volumes like restoring from snapshot or capacity tracking are needed, c) the storage driver is specified through a storage class, and d) the storage driver supports dynamic volume provisioning through a PersistentVolumeClaim (see EphemeralVolumeSource for more information on the connection between this volume type and PersistentVolumeClaim). Use PersistentVolumeClaim or one of the vendor-specific APIs for volumes that persist for longer than the lifecycle of an individual pod. Use CSI for light-weight local ephemeral volumes if the CSI driver is meant to be used that way - see the documentation of the driver for more information. A pod can use both types of ephemeral volumes and persistent volumes at the same time.

volumeClaimTemplate object

Will be used to create a stand-alone PVC to provision the volume. The pod in which this EphemeralVolumeSource is embedded will be the owner of the PVC, i.e. the PVC will be deleted together with the pod. The name of the PVC will be `<pod name>-<volume name>` where `<volume name>` is the name from the `PodSpec.Volumes` array entry. Pod validation will reject the pod if the concatenated name is not valid for a PVC (for example, too long). An existing PVC with that name that is not owned by the pod will *not* be used for the pod to avoid using an unrelated volume by mistake. Starting the pod is then blocked until the unrelated PVC is removed. If such a pre-created PVC is meant to be used by the pod, the PVC has to updated with an owner reference to the pod once the pod exists. Normally this should not be necessary, but it may be useful when manually reconstructing a broken cluster. This field is read-only and no changes will be made by Kubernetes to the PVC after it has been created. Required, must not be nil.

metadata object

May contain labels and annotations that will be copied into the PVC when creating it. No other fields are allowed and will be rejected during validation.

spec object required

The specification for the PersistentVolumeClaim. The entire content is copied unchanged into the PVC that gets created from this template. The same fields as in a PersistentVolumeClaim are also valid here.

accessModes []string

accessModes contains the desired access modes the volume should have. More info: https://kubernetes.io/docs/concepts/storage/persistent-volumes#access-modes-1

dataSource object

dataSource field can be used to specify either: * An existing VolumeSnapshot object (snapshot.storage.k8s.io/VolumeSnapshot) * An existing PVC (PersistentVolumeClaim) If the provisioner or an external controller can support the specified data source, it will create a new volume based on the contents of the specified data source. When the AnyVolumeDataSource feature gate is enabled, dataSource contents will be copied to dataSourceRef, and dataSourceRef contents will be copied to dataSource when dataSourceRef.namespace is not specified. If the namespace is specified, then dataSourceRef will not be copied to dataSource.

apiGroup string

APIGroup is the group for the resource being referenced. If APIGroup is not specified, the specified Kind must be in the core API group. For any other third-party types, APIGroup is required.

kind string required

Kind is the type of resource being referenced

name string required

Name is the name of resource being referenced

dataSourceRef object

dataSourceRef specifies the object from which to populate the volume with data, if a non-empty volume is desired. This may be any object from a non-empty API group (non core object) or a PersistentVolumeClaim object. When this field is specified, volume binding will only succeed if the type of the specified object matches some installed volume populator or dynamic provisioner. This field will replace the functionality of the dataSource field and as such if both fields are non-empty, they must have the same value. For backwards compatibility, when namespace isn't specified in dataSourceRef, both fields (dataSource and dataSourceRef) will be set to the same value automatically if one of them is empty and the other is non-empty. When namespace is specified in dataSourceRef, dataSource isn't set to the same value and must be empty. There are three important differences between dataSource and dataSourceRef: * While dataSource only allows two specific types of objects, dataSourceRef allows any non-core object, as well as PersistentVolumeClaim objects. * While dataSource ignores disallowed values (dropping them), dataSourceRef preserves all values, and generates an error if a disallowed value is specified. * While dataSource only allows local objects, dataSourceRef allows objects in any namespaces. (Beta) Using this field requires the AnyVolumeDataSource feature gate to be enabled. (Alpha) Using the namespace field of dataSourceRef requires the CrossNamespaceVolumeDataSource feature gate to be enabled.

apiGroup string

APIGroup is the group for the resource being referenced. If APIGroup is not specified, the specified Kind must be in the core API group. For any other third-party types, APIGroup is required.

kind string required

Kind is the type of resource being referenced

name string required

Name is the name of resource being referenced

namespace string

Namespace is the namespace of resource being referenced Note that when a namespace is specified, a gateway.networking.k8s.io/ReferenceGrant object is required in the referent namespace to allow that namespace's owner to accept the reference. See the ReferenceGrant documentation for details. (Alpha) This field requires the CrossNamespaceVolumeDataSource feature gate to be enabled.

resources object

resources represents the minimum resources the volume should have. Users are allowed to specify resource requirements that are lower than previous value but must still be higher than capacity recorded in the status field of the claim. More info: https://kubernetes.io/docs/concepts/storage/persistent-volumes#resources

limits object

Limits describes the maximum amount of compute resources allowed. More info: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/

requests object

Requests describes the minimum amount of compute resources required. If Requests is omitted for a container, it defaults to Limits if that is explicitly specified, otherwise to an implementation-defined value. Requests cannot exceed Limits. More info: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/

selector object

selector is a label query over volumes to consider for binding.

matchExpressions []object

matchExpressions is a list of label selector requirements. The requirements are ANDed.

key string required

key is the label key that the selector applies to.

operator string required

operator represents a key's relationship to a set of values. Valid operators are In, NotIn, Exists and DoesNotExist.

values []string

matchLabels object

storageClassName string

storageClassName is the name of the StorageClass required by the claim. More info: https://kubernetes.io/docs/concepts/storage/persistent-volumes#class-1

volumeAttributesClassName string

volumeAttributesClassName may be used to set the VolumeAttributesClass used by this claim. If specified, the CSI driver will create or update the volume with the attributes defined in the corresponding VolumeAttributesClass. This has a different purpose than storageClassName, it can be changed after the claim is created. An empty string or nil value indicates that no VolumeAttributesClass will be applied to the claim. If the claim enters an Infeasible error state, this field can be reset to its previous value (including nil) to cancel the modification. If the resource referred to by volumeAttributesClass does not exist, this PersistentVolumeClaim will be set to a Pending state, as reflected by the modifyVolumeStatus field, until such as a resource exists. More info: https://kubernetes.io/docs/concepts/storage/volume-attributes-classes/

volumeMode string

volumeMode defines what type of volume is required by the claim. Value of Filesystem is implied when not included in claim spec.

volumeName string

volumeName is the binding reference to the PersistentVolume backing this claim.

fc object

fc represents a Fibre Channel resource that is attached to a kubelet's host machine and then exposed to the pod.

fsType string

fsType is the filesystem type to mount. Must be a filesystem type supported by the host operating system. Ex. "ext4", "xfs", "ntfs". Implicitly inferred to be "ext4" if unspecified.

lun integer

lun is Optional: FC target lun number

format: int32

readOnly boolean

readOnly is Optional: Defaults to false (read/write). ReadOnly here will force the ReadOnly setting in VolumeMounts.

targetWWNs []string

targetWWNs is Optional: FC target worldwide names (WWNs)

wwids []string

wwids Optional: FC volume world wide identifiers (wwids) Either wwids or combination of targetWWNs and lun must be set, but not both simultaneously.

flexVolume object

flexVolume represents a generic volume resource that is provisioned/attached using an exec based plugin. Deprecated: FlexVolume is deprecated. Consider using a CSIDriver instead.

driver string required

driver is the name of the driver to use for this volume.

fsType string

fsType is the filesystem type to mount. Must be a filesystem type supported by the host operating system. Ex. "ext4", "xfs", "ntfs". The default filesystem depends on FlexVolume script.

options object

options is Optional: this field holds extra command options if any.

readOnly boolean

readOnly is Optional: defaults to false (read/write). ReadOnly here will force the ReadOnly setting in VolumeMounts.

secretRef object

secretRef is Optional: secretRef is reference to the secret object containing sensitive information to pass to the plugin scripts. This may be empty if no secret object is specified. If the secret object contains more than one secret, all secrets are passed to the plugin scripts.

name string

flocker object

flocker represents a Flocker volume attached to a kubelet's host machine. This depends on the Flocker control service being running. Deprecated: Flocker is deprecated and the in-tree flocker type is no longer supported.

datasetName string

datasetName is Name of the dataset stored as metadata -> name on the dataset for Flocker should be considered as deprecated

datasetUUID string

datasetUUID is the UUID of the dataset. This is unique identifier of a Flocker dataset

gcePersistentDisk object

gcePersistentDisk represents a GCE Disk resource that is attached to a kubelet's host machine and then exposed to the pod. Deprecated: GCEPersistentDisk is deprecated. All operations for the in-tree gcePersistentDisk type are redirected to the pd.csi.storage.gke.io CSI driver. More info: https://kubernetes.io/docs/concepts/storage/volumes#gcepersistentdisk

fsType string

fsType is filesystem type of the volume that you want to mount. Tip: Ensure that the filesystem type is supported by the host operating system. Examples: "ext4", "xfs", "ntfs". Implicitly inferred to be "ext4" if unspecified. More info: https://kubernetes.io/docs/concepts/storage/volumes#gcepersistentdisk

partition integer

format: int32

pdName string required

pdName is unique name of the PD resource in GCE. Used to identify the disk in GCE. More info: https://kubernetes.io/docs/concepts/storage/volumes#gcepersistentdisk

readOnly boolean

readOnly here will force the ReadOnly setting in VolumeMounts. Defaults to false. More info: https://kubernetes.io/docs/concepts/storage/volumes#gcepersistentdisk

gitRepo object

gitRepo represents a git repository at a particular revision. Deprecated: GitRepo is deprecated. To provision a container with a git repo, mount an EmptyDir into an InitContainer that clones the repo using git, then mount the EmptyDir into the Pod's container.

directory string

directory is the target directory name. Must not contain or start with '..'. If '.' is supplied, the volume directory will be the git repository. Otherwise, if specified, the volume will contain the git repository in the subdirectory with the given name.

repository string required

repository is the URL

revision string

revision is the commit hash for the specified revision.

glusterfs object

glusterfs represents a Glusterfs mount on the host that shares a pod's lifetime. Deprecated: Glusterfs is deprecated and the in-tree glusterfs type is no longer supported.

endpoints string required

endpoints is the endpoint name that details Glusterfs topology.

path string required

path is the Glusterfs volume path. More info: https://examples.k8s.io/volumes/glusterfs/README.md#create-a-pod

readOnly boolean

readOnly here will force the Glusterfs volume to be mounted with read-only permissions. Defaults to false. More info: https://examples.k8s.io/volumes/glusterfs/README.md#create-a-pod

hostPath object

hostPath represents a pre-existing file or directory on the host machine that is directly exposed to the container. This is generally used for system agents or other privileged things that are allowed to see the host machine. Most containers will NOT need this. More info: https://kubernetes.io/docs/concepts/storage/volumes#hostpath

path string required

path of the directory on the host. If the path is a symlink, it will follow the link to the real path. More info: https://kubernetes.io/docs/concepts/storage/volumes#hostpath

type string

type for HostPath Volume Defaults to "" More info: https://kubernetes.io/docs/concepts/storage/volumes#hostpath

image object

image represents an OCI object (a container image or artifact) pulled and mounted on the kubelet's host machine. The volume is resolved at pod startup depending on which PullPolicy value is provided: - Always: the kubelet always attempts to pull the reference. Container creation will fail If the pull fails. - Never: the kubelet never pulls the reference and only uses a local image or artifact. Container creation will fail if the reference isn't present. - IfNotPresent: the kubelet pulls if the reference isn't already present on disk. Container creation will fail if the reference isn't present and the pull fails. The volume gets re-resolved if the pod gets deleted and recreated, which means that new remote content will become available on pod recreation. A failure to resolve or pull the image during pod startup will block containers from starting and may add significant latency. Failures will be retried using normal volume backoff and will be reported on the pod reason and message. The types of objects that may be mounted by this volume are defined by the container runtime implementation on a host machine and at minimum must include all valid types supported by the container image field. The OCI object gets mounted in a single directory (spec.containers[*].volumeMounts.mountPath) by merging the manifest layers in the same way as for container images. The volume will be mounted read-only (ro). Sub path mounts for containers are not supported (spec.containers[*].volumeMounts.subpath) before 1.33. The field spec.securityContext.fsGroupChangePolicy has no effect on this volume type.

pullPolicy string

Policy for pulling OCI objects. Possible values are: Always: the kubelet always attempts to pull the reference. Container creation will fail If the pull fails. Never: the kubelet never pulls the reference and only uses a local image or artifact. Container creation will fail if the reference isn't present. IfNotPresent: the kubelet pulls if the reference isn't already present on disk. Container creation will fail if the reference isn't present and the pull fails. Defaults to Always if :latest tag is specified, or IfNotPresent otherwise.

reference string

Required: Image or artifact reference to be used. Behaves in the same way as pod.spec.containers[*].image. Pull secrets will be assembled in the same way as for the container image by looking up node credentials, SA image pull secrets, and pod spec image pull secrets. More info: https://kubernetes.io/docs/concepts/containers/images This field is optional to allow higher level config management to default or override container images in workload controllers like Deployments and StatefulSets.

iscsi object

iscsi represents an ISCSI Disk resource that is attached to a kubelet's host machine and then exposed to the pod. More info: https://kubernetes.io/docs/concepts/storage/volumes/#iscsi

chapAuthDiscovery boolean

chapAuthDiscovery defines whether support iSCSI Discovery CHAP authentication

chapAuthSession boolean

chapAuthSession defines whether support iSCSI Session CHAP authentication

fsType string

initiatorName string

initiatorName is the custom iSCSI Initiator Name. If initiatorName is specified with iscsiInterface simultaneously, new iSCSI interface <target portal>:<volume name> will be created for the connection.

iqn string required

iqn is the target iSCSI Qualified Name.

iscsiInterface string

iscsiInterface is the interface Name that uses an iSCSI transport. Defaults to 'default' (tcp).

lun integer required

lun represents iSCSI Target Lun number.

format: int32

portals []string

portals is the iSCSI Target Portal List. The portal is either an IP or ip_addr:port if the port is other than default (typically TCP ports 860 and 3260).

readOnly boolean

readOnly here will force the ReadOnly setting in VolumeMounts. Defaults to false.

secretRef object

secretRef is the CHAP Secret for iSCSI target and initiator authentication

name string

targetPortal string required

targetPortal is iSCSI Target Portal. The Portal is either an IP or ip_addr:port if the port is other than default (typically TCP ports 860 and 3260).

name string required

name of the volume. Must be a DNS_LABEL and unique within the pod. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#names

nfs object

nfs represents an NFS mount on the host that shares a pod's lifetime More info: https://kubernetes.io/docs/concepts/storage/volumes#nfs

path string required

path that is exported by the NFS server. More info: https://kubernetes.io/docs/concepts/storage/volumes#nfs

readOnly boolean

readOnly here will force the NFS export to be mounted with read-only permissions. Defaults to false. More info: https://kubernetes.io/docs/concepts/storage/volumes#nfs

server string required

server is the hostname or IP address of the NFS server. More info: https://kubernetes.io/docs/concepts/storage/volumes#nfs

persistentVolumeClaim object

persistentVolumeClaimVolumeSource represents a reference to a PersistentVolumeClaim in the same namespace. More info: https://kubernetes.io/docs/concepts/storage/persistent-volumes#persistentvolumeclaims

claimName string required

claimName is the name of a PersistentVolumeClaim in the same namespace as the pod using this volume. More info: https://kubernetes.io/docs/concepts/storage/persistent-volumes#persistentvolumeclaims

readOnly boolean

readOnly Will force the ReadOnly setting in VolumeMounts. Default false.

photonPersistentDisk object

photonPersistentDisk represents a PhotonController persistent disk attached and mounted on kubelets host machine. Deprecated: PhotonPersistentDisk is deprecated and the in-tree photonPersistentDisk type is no longer supported.

fsType string

fsType is the filesystem type to mount. Must be a filesystem type supported by the host operating system. Ex. "ext4", "xfs", "ntfs". Implicitly inferred to be "ext4" if unspecified.

pdID string required

pdID is the ID that identifies Photon Controller persistent disk

portworxVolume object

portworxVolume represents a portworx volume attached and mounted on kubelets host machine. Deprecated: PortworxVolume is deprecated. All operations for the in-tree portworxVolume type are redirected to the pxd.portworx.com CSI driver.

fsType string

fSType represents the filesystem type to mount Must be a filesystem type supported by the host operating system. Ex. "ext4", "xfs". Implicitly inferred to be "ext4" if unspecified.

readOnly boolean

readOnly defaults to false (read/write). ReadOnly here will force the ReadOnly setting in VolumeMounts.

volumeID string required

volumeID uniquely identifies a Portworx volume

projected object

projected items for all in one resources secrets, configmaps, and downward API

defaultMode integer

defaultMode are the mode bits used to set permissions on created files by default. Must be an octal value between 0000 and 0777 or a decimal value between 0 and 511. YAML accepts both octal and decimal values, JSON requires decimal values for mode bits. Directories within the path are not affected by this setting. This might be in conflict with other options that affect the file mode, like fsGroup, and the result can be other mode bits set.

format: int32

sources []object

sources is the list of volume projections. Each entry in this list handles one source.

clusterTrustBundle object

ClusterTrustBundle allows a pod to access the `.spec.trustBundle` field of ClusterTrustBundle objects in an auto-updating file. Alpha, gated by the ClusterTrustBundleProjection feature gate. ClusterTrustBundle objects can either be selected by name, or by the combination of signer name and a label selector. Kubelet performs aggressive normalization of the PEM contents written into the pod filesystem. Esoteric PEM features such as inter-block comments and block headers are stripped. Certificates are deduplicated. The ordering of certificates within the file is arbitrary, and Kubelet may change the order over time.

labelSelector object

Select all ClusterTrustBundles that match this label selector. Only has effect if signerName is set. Mutually-exclusive with name. If unset, interpreted as "match nothing". If set but empty, interpreted as "match everything".

matchExpressions []object

matchExpressions is a list of label selector requirements. The requirements are ANDed.

key string required

key is the label key that the selector applies to.

operator string required

operator represents a key's relationship to a set of values. Valid operators are In, NotIn, Exists and DoesNotExist.

values []string

matchLabels object

name string

Select a single ClusterTrustBundle by object name. Mutually-exclusive with signerName and labelSelector.

optional boolean

If true, don't block pod startup if the referenced ClusterTrustBundle(s) aren't available. If using name, then the named ClusterTrustBundle is allowed not to exist. If using signerName, then the combination of signerName and labelSelector is allowed to match zero ClusterTrustBundles.

path string required

Relative path from the volume root to write the bundle.

signerName string

Select all ClusterTrustBundles that match this signer name. Mutually-exclusive with name. The contents of all selected ClusterTrustBundles will be unified and deduplicated.

configMap object

configMap information about the configMap data to project

items []object

key string required

key is the key to project.

mode integer

format: int32

path string required

path is the relative path of the file to map the key to. May not be an absolute path. May not contain the path element '..'. May not start with the string '..'.

name string

optional boolean

optional specify whether the ConfigMap or its keys must be defined

downwardAPI object

downwardAPI information about the downwardAPI data to project

items []object

Items is a list of DownwardAPIVolume file

fieldRef object

Required: Selects a field of the pod: only annotations, labels, name, namespace and uid are supported.

apiVersion string

Version of the schema the FieldPath is written in terms of, defaults to "v1".

fieldPath string required

Path of the field to select in the specified API version.

mode integer

format: int32

path string required

Required: Path is the relative path name of the file to be created. Must not be absolute or contain the '..' path. Must be utf-8 encoded. The first item of the relative path must not start with '..'

resourceFieldRef object

Selects a resource of the container: only resources limits and requests (limits.cpu, limits.memory, requests.cpu and requests.memory) are currently supported.

containerName string

Container name: required for volumes, optional for env vars

divisor string | integer

Specifies the output format of the exposed resources, defaults to "1"

string pattern: ^(\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))(([KMGTPE]i)|[numkMGTPE]|([eE](\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))))?$

resource string required

Required: resource to select

podCertificate object

Projects an auto-rotating credential bundle (private key and certificate chain) that the pod can use either as a TLS client or server. Kubelet generates a private key and uses it to send a PodCertificateRequest to the named signer. Once the signer approves the request and issues a certificate chain, Kubelet writes the key and certificate chain to the pod filesystem. The pod does not start until certificates have been issued for each podCertificate projected volume source in its spec. Kubelet will begin trying to rotate the certificate at the time indicated by the signer using the PodCertificateRequest.Status.BeginRefreshAt timestamp. Kubelet can write a single file, indicated by the credentialBundlePath field, or separate files, indicated by the keyPath and certificateChainPath fields. The credential bundle is a single file in PEM format. The first PEM entry is the private key (in PKCS#8 format), and the remaining PEM entries are the certificate chain issued by the signer (typically, signers will return their certificate chain in leaf-to-root order). Prefer using the credential bundle format, since your application code can read it atomically. If you use keyPath and certificateChainPath, your application must make two separate file reads. If these coincide with a certificate rotation, it is possible that the private key and leaf certificate you read may not correspond to each other. Your application will need to check for this condition, and re-read until they are consistent. The named signer controls chooses the format of the certificate it issues; consult the signer implementation's documentation to learn how to use the certificates it issues.

certificateChainPath string

Write the certificate chain at this path in the projected volume. Most applications should use credentialBundlePath. When using keyPath and certificateChainPath, your application needs to check that the key and leaf certificate are consistent, because it is possible to read the files mid-rotation.

credentialBundlePath string

Write the credential bundle at this path in the projected volume. The credential bundle is a single file that contains multiple PEM blocks. The first PEM block is a PRIVATE KEY block, containing a PKCS#8 private key. The remaining blocks are CERTIFICATE blocks, containing the issued certificate chain from the signer (leaf and any intermediates). Using credentialBundlePath lets your Pod's application code make a single atomic read that retrieves a consistent key and certificate chain. If you project them to separate files, your application code will need to additionally check that the leaf certificate was issued to the key.

keyPath string

Write the key at this path in the projected volume. Most applications should use credentialBundlePath. When using keyPath and certificateChainPath, your application needs to check that the key and leaf certificate are consistent, because it is possible to read the files mid-rotation.

keyType string required

The type of keypair Kubelet will generate for the pod. Valid values are "RSA3072", "RSA4096", "ECDSAP256", "ECDSAP384", "ECDSAP521", and "ED25519".

maxExpirationSeconds integer

maxExpirationSeconds is the maximum lifetime permitted for the certificate. Kubelet copies this value verbatim into the PodCertificateRequests it generates for this projection. If omitted, kube-apiserver will set it to 86400(24 hours). kube-apiserver will reject values shorter than 3600 (1 hour). The maximum allowable value is 7862400 (91 days). The signer implementation is then free to issue a certificate with any lifetime *shorter* than MaxExpirationSeconds, but no shorter than 3600 seconds (1 hour). This constraint is enforced by kube-apiserver. `kubernetes.io` signers will never issue certificates with a lifetime longer than 24 hours.

format: int32

signerName string required

Kubelet's generated CSRs will be addressed to this signer.

userAnnotations object

userAnnotations allow pod authors to pass additional information to the signer implementation. Kubernetes does not restrict or validate this metadata in any way. These values are copied verbatim into the `spec.unverifiedUserAnnotations` field of the PodCertificateRequest objects that Kubelet creates. Entries are subject to the same validation as object metadata annotations, with the addition that all keys must be domain-prefixed. No restrictions are placed on values, except an overall size limitation on the entire field. Signers should document the keys and values they support. Signers should deny requests that contain keys they do not recognize.

secret object

secret information about the secret data to project

items []object

items if unspecified, each key-value pair in the Data field of the referenced Secret will be projected into the volume as a file whose name is the key and content is the value. If specified, the listed keys will be projected into the specified paths, and unlisted keys will not be present. If a key is specified which is not present in the Secret, the volume setup will error unless it is marked optional. Paths must be relative and may not contain the '..' path or start with '..'.

key string required

key is the key to project.

mode integer

format: int32

path string required

path is the relative path of the file to map the key to. May not be an absolute path. May not contain the path element '..'. May not start with the string '..'.

name string

optional boolean

optional field specify whether the Secret or its key must be defined

serviceAccountToken object

serviceAccountToken is information about the serviceAccountToken data to project

audience string

audience is the intended audience of the token. A recipient of a token must identify itself with an identifier specified in the audience of the token, and otherwise should reject the token. The audience defaults to the identifier of the apiserver.

expirationSeconds integer

expirationSeconds is the requested duration of validity of the service account token. As the token approaches expiration, the kubelet volume plugin will proactively rotate the service account token. The kubelet will start trying to rotate the token if the token is older than 80 percent of its time to live or if the token is older than 24 hours.Defaults to 1 hour and must be at least 10 minutes.

format: int64

path string required

path is the path relative to the mount point of the file to project the token into.

quobyte object

quobyte represents a Quobyte mount on the host that shares a pod's lifetime. Deprecated: Quobyte is deprecated and the in-tree quobyte type is no longer supported.

group string

group to map volume access to Default is no group

readOnly boolean

readOnly here will force the Quobyte volume to be mounted with read-only permissions. Defaults to false.

registry string required

registry represents a single or multiple Quobyte Registry services specified as a string as host:port pair (multiple entries are separated with commas) which acts as the central registry for volumes

tenant string

tenant owning the given Quobyte volume in the Backend Used with dynamically provisioned Quobyte volumes, value is set by the plugin

user string

user to map volume access to Defaults to serivceaccount user

volume string required

volume is a string that references an already created Quobyte volume by name.

rbd object

rbd represents a Rados Block Device mount on the host that shares a pod's lifetime. Deprecated: RBD is deprecated and the in-tree rbd type is no longer supported.

fsType string

image string required

image is the rados image name. More info: https://examples.k8s.io/volumes/rbd/README.md#how-to-use-it

keyring string

keyring is the path to key ring for RBDUser. Default is /etc/ceph/keyring. More info: https://examples.k8s.io/volumes/rbd/README.md#how-to-use-it

monitors []string required

monitors is a collection of Ceph monitors. More info: https://examples.k8s.io/volumes/rbd/README.md#how-to-use-it

pool string

pool is the rados pool name. Default is rbd. More info: https://examples.k8s.io/volumes/rbd/README.md#how-to-use-it

readOnly boolean

readOnly here will force the ReadOnly setting in VolumeMounts. Defaults to false. More info: https://examples.k8s.io/volumes/rbd/README.md#how-to-use-it

secretRef object

secretRef is name of the authentication secret for RBDUser. If provided overrides keyring. Default is nil. More info: https://examples.k8s.io/volumes/rbd/README.md#how-to-use-it

name string

user string

user is the rados user name. Default is admin. More info: https://examples.k8s.io/volumes/rbd/README.md#how-to-use-it

scaleIO object

scaleIO represents a ScaleIO persistent volume attached and mounted on Kubernetes nodes. Deprecated: ScaleIO is deprecated and the in-tree scaleIO type is no longer supported.

fsType string

fsType is the filesystem type to mount. Must be a filesystem type supported by the host operating system. Ex. "ext4", "xfs", "ntfs". Default is "xfs".

gateway string required

gateway is the host address of the ScaleIO API Gateway.

protectionDomain string

protectionDomain is the name of the ScaleIO Protection Domain for the configured storage.

readOnly boolean

readOnly Defaults to false (read/write). ReadOnly here will force the ReadOnly setting in VolumeMounts.

secretRef object required

secretRef references to the secret for ScaleIO user and other sensitive information. If this is not provided, Login operation will fail.

name string

sslEnabled boolean

sslEnabled Flag enable/disable SSL communication with Gateway, default false

storageMode string

storageMode indicates whether the storage for a volume should be ThickProvisioned or ThinProvisioned. Default is ThinProvisioned.

storagePool string

storagePool is the ScaleIO Storage Pool associated with the protection domain.

system string required

system is the name of the storage system as configured in ScaleIO.

volumeName string

volumeName is the name of a volume already created in the ScaleIO system that is associated with this volume source.

secret object

secret represents a secret that should populate this volume. More info: https://kubernetes.io/docs/concepts/storage/volumes#secret

defaultMode integer

defaultMode is Optional: mode bits used to set permissions on created files by default. Must be an octal value between 0000 and 0777 or a decimal value between 0 and 511. YAML accepts both octal and decimal values, JSON requires decimal values for mode bits. Defaults to 0644. Directories within the path are not affected by this setting. This might be in conflict with other options that affect the file mode, like fsGroup, and the result can be other mode bits set.

format: int32

items []object

items If unspecified, each key-value pair in the Data field of the referenced Secret will be projected into the volume as a file whose name is the key and content is the value. If specified, the listed keys will be projected into the specified paths, and unlisted keys will not be present. If a key is specified which is not present in the Secret, the volume setup will error unless it is marked optional. Paths must be relative and may not contain the '..' path or start with '..'.

key string required

key is the key to project.

mode integer

format: int32

path string required

path is the relative path of the file to map the key to. May not be an absolute path. May not contain the path element '..'. May not start with the string '..'.

optional boolean

optional field specify whether the Secret or its keys must be defined

secretName string

secretName is the name of the secret in the pod's namespace to use. More info: https://kubernetes.io/docs/concepts/storage/volumes#secret

storageos object

storageOS represents a StorageOS volume attached and mounted on Kubernetes nodes. Deprecated: StorageOS is deprecated and the in-tree storageos type is no longer supported.

fsType string

fsType is the filesystem type to mount. Must be a filesystem type supported by the host operating system. Ex. "ext4", "xfs", "ntfs". Implicitly inferred to be "ext4" if unspecified.

readOnly boolean

readOnly defaults to false (read/write). ReadOnly here will force the ReadOnly setting in VolumeMounts.

secretRef object

secretRef specifies the secret to use for obtaining the StorageOS API credentials. If not specified, default values will be attempted.

name string

volumeName string

volumeName is the human-readable name of the StorageOS volume. Volume names are only unique within a namespace.

volumeNamespace string

volumeNamespace specifies the scope of the volume within StorageOS. If no namespace is specified then the Pod's namespace will be used. This allows the Kubernetes name scoping to be mirrored within StorageOS for tighter integration. Set VolumeName to any name to override the default behaviour. Set to "default" if you are not using namespaces within StorageOS. Namespaces that do not pre-exist within StorageOS will be created.

vsphereVolume object

vsphereVolume represents a vSphere volume attached and mounted on kubelets host machine. Deprecated: VsphereVolume is deprecated. All operations for the in-tree vsphereVolume type are redirected to the csi.vsphere.vmware.com CSI driver.

fsType string

fsType is filesystem type to mount. Must be a filesystem type supported by the host operating system. Ex. "ext4", "xfs", "ntfs". Implicitly inferred to be "ext4" if unspecified.

storagePolicyID string

storagePolicyID is the storage Policy Based Management (SPBM) profile ID associated with the StoragePolicyName.

storagePolicyName string

storagePolicyName is the storage Policy Based Management (SPBM) profile name.

volumePath string required

volumePath is the path that identifies vSphere volume vmdk

flashAttention boolean

FlashAttention enables flash attention for faster prompt processing and reduced KV cache memory. Maps to llama.cpp --flash-attn flag. On NVIDIA GPUs requires Ampere or newer (compute capability 8.0+). On Apple Silicon (Metal agent path) the default is true when this field is unset, because the wired-collector + flash-attn combination prevents the ~25% decode degradation observed at long context on Qwen-class models running on M-series chips.

hotCacheMaxSize string

HotCacheMaxSize sets the maximum size of the oMLX hot cache. Maps to oMLX --hot-cache-max-size. The hot cache holds recently used KV cache blocks in RAM for fast access. A string value like "100GB" or "50GB". Only meaningful for the omlx runtime; ignored by llamacpp and other runtimes.

image string

Image is the container image for the inference runtime. For llamacpp runtime, defaults to ghcr.io/ggml-org/llama.cpp:server. For generic runtime, this field is required.

imagePullSecrets []object

ImagePullSecrets for pulling container images from private registries.

name string

jinja boolean

Jinja enables Jinja2 chat template rendering for tool/function calling support. Required when using the OpenAI-compatible API with tools. Maps to llama.cpp --jinja flag.

maxPodLifetimeSeconds integer

MaxPodLifetimeSeconds sets the maximum lifetime (in seconds) for inference pods. When set, the operator copies this value to PodSpec.ActiveDeadlineSeconds on the generated Deployment's pod template, causing Kubernetes to terminate the pod after the specified duration even if it remains healthy. This is useful for workloads that need periodic process recycling to release driver memory (e.g. llama.cpp on AMD Vulkan with pinned GTT memory). When omitted, pods run indefinitely until manually restarted or the Deployment is updated.

format: int64

minimum: 1

metadataOverrides []string

MetadataOverrides overrides GGUF metadata key-value pairs at model load time. Each entry is passed as a separate --override-kv flag. Format: key=type:value (e.g., "qwen35moe.context_length=int:1048576" to extend context window, or "tokenizer.chat_template.thinking=bool:false" to tweak tokenizer behavior). Maps to llama.cpp --override-kv flag (one flag per entry).

mode string

Mode selects how the model is served: "chat" (default) for chat/completion, "embedding" for /v1/embeddings, "rerank" for /v1/rerank. For the llamacpp runtime it auto-appends the required flags (embedding -> --embedding --pooling last; rerank -> --reranking --embedding --pooling rank); any flag already set in spec.extraArgs wins. When unset, the mode is inferred from spec.extraArgs / spec.endpoint.path. The resolved value is always reported in status.mode.

enum: chat, embedding, rerank

modelCache object

ModelCache overrides where this InferenceService caches model weights: when claimName is set, the named user-owned PVC is mounted as the writable model cache (prep + download init containers run against it) instead of the operator's shared/perService cache PVC. When unset, the operator-global cache mode applies unchanged.

claimName string

ClaimName names a pre-existing PersistentVolumeClaim in the InferenceService's namespace to use as the writable model cache volume. Weights land under the usual <cacheKey>/ subdirectory of the claim, so RefreshPolicy and cache-key semantics are unchanged and multiple models can share one claim without colliding. The claim must already exist: when it is missing the InferenceService is marked Degraded rather than silently falling back to the shared cache. Ignored for pvc:// model sources (already staged, read-only, no download). Node alignment of RWO/local claims (via nodeSelector) is the user's responsibility.

minLength: 1

maxLength: 253

modelRef string required

ModelRef references the Model CR that contains the model to serve

moeCPULayers integer

MoeCPULayers sets the number of MoE layers to offload to CPU. When set, only the specified number of MoE layers run on CPU rather than all. Maps to llama.cpp --n-cpu-moe flag.

format: int32

minimum: 0

moeCPUOffload boolean

MoeCPUOffload offloads all MoE expert layers to CPU for reduced VRAM usage. Enables running large MoE models (e.g., Qwen3-30B, Mixtral) on VRAM-constrained hardware by keeping attention layers on GPU while expert weights use system RAM. Maps to llama.cpp --cpu-moe flag. Requires sufficient system RAM via resources.memory.

noKvOffload boolean

NoKvOffload keeps the KV cache in system RAM instead of VRAM. Useful for extended context windows when VRAM is constrained by model weights. Maps to llama.cpp --no-kv-offload flag. Requires sufficient system RAM via resources.memory.

noWarmup boolean

NoWarmup skips the llama.cpp startup warmup inference pass. Reduces pod ready time at the cost of slightly higher first-request latency. Useful for scale-to-zero and quick redeployment patterns. Maps to llama.cpp --no-warmup flag.

nodeSelector object

NodeSelector for pod placement (e.g., specific node pools)

pagedSSDCacheDir string

PagedSSDCacheDir sets the directory for the oMLX paged SSD cache. Maps to oMLX --paged-ssd-cache-dir. When set, the oMLX daemon uses a paged cache backed by the specified directory, allowing models to exceed available RAM by paging KV cache blocks to SSD. The directory must exist and be writable by the oMLX process. Only meaningful for the omlx runtime; ignored by llamacpp and other runtimes.

pagedSSDCacheMaxSize string

PagedSSDCacheMaxSize sets the maximum size of the oMLX paged SSD cache. Maps to oMLX --paged-ssd-cache-max-size. The paged cache holds KV cache blocks that have been evicted from RAM to SSD. A string value like "200GB" or "500GB". Only meaningful for the omlx runtime; ignored by llamacpp and other runtimes.

parallelSlots integer

ParallelSlots sets the number of concurrent request slots for the llama.cpp server (--parallel flag). Each slot processes one request independently; higher values use more KV cache memory. If not specified, the operator omits --parallel and llama.cpp picks an auto value (currently 4).

format: int32

minimum: 1

maximum: 64

personaPlexConfig object

PersonaPlexConfig holds configuration for the PersonaPlex (Moshi) runtime. Only used when Runtime is "personaplex".

cpuOffload boolean

CPUOffload enables model weight offloading to system RAM when GPU VRAM is insufficient. Requires the accelerate package in the container image.

hfTokenSecretRef object

HFTokenSecretRef references a Secret containing the HuggingFace token for model download. The Secret key must be "HF_TOKEN".

key string required

The key of the secret to select from. Must be a valid secret key.

name string

optional boolean

Specify whether the Secret or its key must be defined

quantize4Bit boolean

Quantize4Bit enables NF4 4-bit quantization for reduced VRAM usage (~9.6 GB vs ~14 GB). Requires the bitsandbytes package in the container image.

podAnnotations object

PodAnnotations are merged into the inference Pod's metadata.annotations. Use this to tag Pods for downstream tooling (cost attribution, service mesh routing, custom admission controllers) without those tools needing to know about LLMKube's CRD schema. Pure passthrough; the operator itself does not set any annotations on inference Pods today.

podLabels object

PodLabels are merged into the inference Pod's metadata.labels alongside the operator-managed labels (`app`, `inference.llmkube.dev/model`, `inference.llmkube.dev/service`). Operator-managed keys take precedence on collision so the Deployment selector stays in sync with the Pods it owns. The Deployment selector itself uses only the operator-managed labels and is immutable, so changing PodLabels later is safe.

podSecurityContext object

PodSecurityContext defines pod-level security attributes for inference pods. Use this to set fsGroup for volume permissions (required on OpenShift).

appArmorProfile object

appArmorProfile is the AppArmor options to use by the containers in this pod. Note that this field cannot be set when spec.os.name is windows.

localhostProfile string

localhostProfile indicates a profile loaded on the node that should be used. The profile must be preconfigured on the node to work. Must match the loaded name of the profile. Must be set if and only if type is "Localhost".

type string required

type indicates which kind of AppArmor profile will be applied. Valid options are: Localhost - a profile pre-loaded on the node. RuntimeDefault - the container runtime's default profile. Unconfined - no AppArmor enforcement.

fsGroup integer

A special supplemental group that applies to all containers in a pod. Some volume types allow the Kubelet to change the ownership of that volume to be owned by the pod: 1. The owning GID will be the FSGroup 2. The setgid bit is set (new files created in the volume will be owned by FSGroup) 3. The permission bits are OR'd with rw-rw---- If unset, the Kubelet will not modify the ownership and permissions of any volume. Note that this field cannot be set when spec.os.name is windows.

format: int64

fsGroupChangePolicy string

fsGroupChangePolicy defines behavior of changing ownership and permission of the volume before being exposed inside Pod. This field will only apply to volume types which support fsGroup based ownership(and permissions). It will have no effect on ephemeral volume types such as: secret, configmaps and emptydir. Valid values are "OnRootMismatch" and "Always". If not specified, "Always" is used. Note that this field cannot be set when spec.os.name is windows.

runAsGroup integer

The GID to run the entrypoint of the container process. Uses runtime default if unset. May also be set in SecurityContext. If set in both SecurityContext and PodSecurityContext, the value specified in SecurityContext takes precedence for that container. Note that this field cannot be set when spec.os.name is windows.

format: int64

runAsNonRoot boolean

Indicates that the container must run as a non-root user. If true, the Kubelet will validate the image at runtime to ensure that it does not run as UID 0 (root) and fail to start the container if it does. If unset or false, no such validation will be performed. May also be set in SecurityContext. If set in both SecurityContext and PodSecurityContext, the value specified in SecurityContext takes precedence.

runAsUser integer

The UID to run the entrypoint of the container process. Defaults to user specified in image metadata if unspecified. May also be set in SecurityContext. If set in both SecurityContext and PodSecurityContext, the value specified in SecurityContext takes precedence for that container. Note that this field cannot be set when spec.os.name is windows.

format: int64

seLinuxChangePolicy string

seLinuxChangePolicy defines how the container's SELinux label is applied to all volumes used by the Pod. It has no effect on nodes that do not support SELinux or to volumes does not support SELinux. Valid values are "MountOption" and "Recursive". "Recursive" means relabeling of all files on all Pod volumes by the container runtime. This may be slow for large volumes, but allows mixing privileged and unprivileged Pods sharing the same volume on the same node. "MountOption" mounts all eligible Pod volumes with `-o context` mount option. This requires all Pods that share the same volume to use the same SELinux label. It is not possible to share the same volume among privileged and unprivileged Pods. Eligible volumes are in-tree FibreChannel and iSCSI volumes, and all CSI volumes whose CSI driver announces SELinux support by setting spec.seLinuxMount: true in their CSIDriver instance. Other volumes are always re-labelled recursively. "MountOption" value is allowed only when SELinuxMount feature gate is enabled. If not specified and SELinuxMount feature gate is enabled, "MountOption" is used. If not specified and SELinuxMount feature gate is disabled, "MountOption" is used for ReadWriteOncePod volumes and "Recursive" for all other volumes. This field affects only Pods that have SELinux label set, either in PodSecurityContext or in SecurityContext of all containers. All Pods that use the same volume should use the same seLinuxChangePolicy, otherwise some pods can get stuck in ContainerCreating state. Note that this field cannot be set when spec.os.name is windows.

seLinuxOptions object

The SELinux context to be applied to all containers. If unspecified, the container runtime will allocate a random SELinux context for each container. May also be set in SecurityContext. If set in both SecurityContext and PodSecurityContext, the value specified in SecurityContext takes precedence for that container. Note that this field cannot be set when spec.os.name is windows.

level string

Level is SELinux level label that applies to the container.

role string

Role is a SELinux role label that applies to the container.

type string

Type is a SELinux type label that applies to the container.

user string

User is a SELinux user label that applies to the container.

seccompProfile object

The seccomp options to use by the containers in this pod. Note that this field cannot be set when spec.os.name is windows.

localhostProfile string

localhostProfile indicates a profile defined in a file on the node should be used. The profile must be preconfigured on the node to work. Must be a descending path, relative to the kubelet's configured seccomp profile location. Must be set if type is "Localhost". Must NOT be set for any other type.

type string required

type indicates which kind of seccomp profile will be applied. Valid options are: Localhost - a profile defined in a file on the node should be used. RuntimeDefault - the container runtime default profile should be used. Unconfined - no profile should be applied.

supplementalGroups []integer

A list of groups applied to the first process run in each container, in addition to the container's primary GID and fsGroup (if specified). If the SupplementalGroupsPolicy feature is enabled, the supplementalGroupsPolicy field determines whether these are in addition to or instead of any group memberships defined in the container image. If unspecified, no additional groups are added, though group memberships defined in the container image may still be used, depending on the supplementalGroupsPolicy field. Note that this field cannot be set when spec.os.name is windows.

supplementalGroupsPolicy string

Defines how supplemental groups of the first container processes are calculated. Valid values are "Merge" and "Strict". If not specified, "Merge" is used. (Alpha) Using the field requires the SupplementalGroupsPolicy feature gate to be enabled and the container runtime must implement support for this feature. Note that this field cannot be set when spec.os.name is windows.

sysctls []object

Sysctls hold a list of namespaced sysctls used for the pod. Pods with unsupported sysctls (by the container runtime) might fail to launch. Note that this field cannot be set when spec.os.name is windows.

name string required

Name of a property to set

value string required

Value of a property to set

windowsOptions object

The Windows specific settings applied to all containers. If unspecified, the options within a container's SecurityContext will be used. If set in both SecurityContext and PodSecurityContext, the value specified in SecurityContext takes precedence. Note that this field cannot be set when spec.os.name is linux.

gmsaCredentialSpec string

GMSACredentialSpec is where the GMSA admission webhook (https://github.com/kubernetes-sigs/windows-gmsa) inlines the contents of the GMSA credential spec named by the GMSACredentialSpecName field.

gmsaCredentialSpecName string

GMSACredentialSpecName is the name of the GMSA credential spec to use.

hostProcess boolean

HostProcess determines if a container should be run as a 'Host Process' container. All of a Pod's containers must have the same effective HostProcess value (it is not allowed to have a mix of HostProcess containers and non-HostProcess containers). In addition, if HostProcess is true then HostNetwork must also be set to true.

runAsUserName string

The UserName in Windows to run the entrypoint of the container process. Defaults to the user specified in image metadata if unspecified. May also be set in PodSecurityContext. If set in both SecurityContext and PodSecurityContext, the value specified in SecurityContext takes precedence.

priority string

Priority determines scheduling priority for GPU allocation. Higher priority services can preempt lower priority ones when GPUs are scarce.

enum: critical, high, normal, low, batch

priorityClassName string

PriorityClassName allows specifying a custom Kubernetes PriorityClass. Takes precedence over the Priority field if set.

probeOverrides object

ProbeOverrides allows replacing the auto-generated health probes. Useful for runtimes with non-HTTP health endpoints (e.g., TCP, WebSocket).

liveness object

Liveness overrides the liveness probe.

exec object

Exec specifies a command to execute in the container.

command []string

Command is the command line to execute inside the container, the working directory for the command is root ('/') in the container's filesystem. The command is simply exec'd, it is not run inside a shell, so traditional shell instructions ('|', etc) won't work. To use a shell, you need to explicitly call out to that shell. Exit status of 0 is treated as live/healthy and non-zero is unhealthy.

failureThreshold integer

Minimum consecutive failures for the probe to be considered failed after having succeeded. Defaults to 3. Minimum value is 1.

format: int32

grpc object

GRPC specifies a GRPC HealthCheckRequest.

port integer required

Port number of the gRPC service. Number must be in the range 1 to 65535.

format: int32

service string

Service is the name of the service to place in the gRPC HealthCheckRequest (see https://github.com/grpc/grpc/blob/master/doc/health-checking.md). If this is not specified, the default behavior is defined by gRPC.

httpGet object

HTTPGet specifies an HTTP GET request to perform.

host string

Host name to connect to, defaults to the pod IP. You probably want to set "Host" in httpHeaders instead.

httpHeaders []object

Custom headers to set in the request. HTTP allows repeated headers.

name string required

The header field name. This will be canonicalized upon output, so case-variant names will be understood as the same header.

value string required

The header field value

path string

Path to access on the HTTP server.

port string | integer required

Name or number of the port to access on the container. Number must be in the range 1 to 65535. Name must be an IANA_SVC_NAME.

scheme string

Scheme to use for connecting to the host. Defaults to HTTP.

initialDelaySeconds integer

Number of seconds after the container has started before liveness probes are initiated. More info: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle#container-probes

format: int32

periodSeconds integer

How often (in seconds) to perform the probe. Default to 10 seconds. Minimum value is 1.

format: int32

successThreshold integer

Minimum consecutive successes for the probe to be considered successful after having failed. Defaults to 1. Must be 1 for liveness and startup. Minimum value is 1.

format: int32

tcpSocket object

TCPSocket specifies a connection to a TCP port.

host string

Optional: Host name to connect to, defaults to the pod IP.

port string | integer required

Number or name of the port to access on the container. Number must be in the range 1 to 65535. Name must be an IANA_SVC_NAME.

terminationGracePeriodSeconds integer

Optional duration in seconds the pod needs to terminate gracefully upon probe failure. The grace period is the duration in seconds after the processes running in the pod are sent a termination signal and the time when the processes are forcibly halted with a kill signal. Set this value longer than the expected cleanup time for your process. If this value is nil, the pod's terminationGracePeriodSeconds will be used. Otherwise, this value overrides the value provided by the pod spec. Value must be non-negative integer. The value zero indicates stop immediately via the kill signal (no opportunity to shut down). This is a beta field and requires enabling ProbeTerminationGracePeriod feature gate. Minimum value is 1. spec.terminationGracePeriodSeconds is used if unset.

format: int64

timeoutSeconds integer

Number of seconds after which the probe times out. Defaults to 1 second. Minimum value is 1. More info: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle#container-probes

format: int32

readiness object

Readiness overrides the readiness probe.

exec object

Exec specifies a command to execute in the container.

command []string

failureThreshold integer

Minimum consecutive failures for the probe to be considered failed after having succeeded. Defaults to 3. Minimum value is 1.

format: int32

grpc object

GRPC specifies a GRPC HealthCheckRequest.

port integer required

Port number of the gRPC service. Number must be in the range 1 to 65535.

format: int32

service string

httpGet object

HTTPGet specifies an HTTP GET request to perform.

host string

Host name to connect to, defaults to the pod IP. You probably want to set "Host" in httpHeaders instead.

httpHeaders []object

Custom headers to set in the request. HTTP allows repeated headers.

name string required

The header field name. This will be canonicalized upon output, so case-variant names will be understood as the same header.

value string required

The header field value

path string

Path to access on the HTTP server.

port string | integer required

Name or number of the port to access on the container. Number must be in the range 1 to 65535. Name must be an IANA_SVC_NAME.

scheme string

Scheme to use for connecting to the host. Defaults to HTTP.

initialDelaySeconds integer

Number of seconds after the container has started before liveness probes are initiated. More info: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle#container-probes

format: int32

periodSeconds integer

How often (in seconds) to perform the probe. Default to 10 seconds. Minimum value is 1.

format: int32

successThreshold integer

Minimum consecutive successes for the probe to be considered successful after having failed. Defaults to 1. Must be 1 for liveness and startup. Minimum value is 1.

format: int32

tcpSocket object

TCPSocket specifies a connection to a TCP port.

host string

Optional: Host name to connect to, defaults to the pod IP.

port string | integer required

Number or name of the port to access on the container. Number must be in the range 1 to 65535. Name must be an IANA_SVC_NAME.

terminationGracePeriodSeconds integer

format: int64

timeoutSeconds integer

Number of seconds after which the probe times out. Defaults to 1 second. Minimum value is 1. More info: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle#container-probes

format: int32

startup object

Startup overrides the startup probe.

exec object

Exec specifies a command to execute in the container.

command []string

failureThreshold integer

Minimum consecutive failures for the probe to be considered failed after having succeeded. Defaults to 3. Minimum value is 1.

format: int32

grpc object

GRPC specifies a GRPC HealthCheckRequest.

port integer required

Port number of the gRPC service. Number must be in the range 1 to 65535.

format: int32

service string

httpGet object

HTTPGet specifies an HTTP GET request to perform.

host string

Host name to connect to, defaults to the pod IP. You probably want to set "Host" in httpHeaders instead.

httpHeaders []object

Custom headers to set in the request. HTTP allows repeated headers.

name string required

The header field name. This will be canonicalized upon output, so case-variant names will be understood as the same header.

value string required

The header field value

path string

Path to access on the HTTP server.

port string | integer required

Name or number of the port to access on the container. Number must be in the range 1 to 65535. Name must be an IANA_SVC_NAME.

scheme string

Scheme to use for connecting to the host. Defaults to HTTP.

initialDelaySeconds integer

Number of seconds after the container has started before liveness probes are initiated. More info: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle#container-probes

format: int32

periodSeconds integer

How often (in seconds) to perform the probe. Default to 10 seconds. Minimum value is 1.

format: int32

successThreshold integer

Minimum consecutive successes for the probe to be considered successful after having failed. Defaults to 1. Must be 1 for liveness and startup. Minimum value is 1.

format: int32

tcpSocket object

TCPSocket specifies a connection to a TCP port.

host string

Optional: Host name to connect to, defaults to the pod IP.

port string | integer required

Number or name of the port to access on the container. Number must be in the range 1 to 65535. Name must be an IANA_SVC_NAME.

terminationGracePeriodSeconds integer

format: int64

timeoutSeconds integer

Number of seconds after which the probe times out. Defaults to 1 second. Minimum value is 1. More info: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle#container-probes

format: int32

reasoningBudget integer

ReasoningBudget caps the number of reasoning tokens the model is allowed to emit per response. Zero disables visible thinking output entirely; the model still reasons internally but does not emit thinking tokens. Critical for production agentic workloads on thinking models (Qwen 3.6, GLM-5) where runaway reasoning can burn compute. Maps to llama.cpp --reasoning-budget flag.

format: int32

minimum: 0

reasoningBudgetMessage string

ReasoningBudgetMessage is injected when the reasoning budget is exhausted, forcing the model to conclude. Ignored unless ReasoningBudget is also set. Maps to llama.cpp --reasoning-budget-message flag.

replicas integer

Replicas is the desired number of inference pods

format: int32

minimum: 0

maximum: 10

resources object

Resources defines compute resources for inference pods

cpu string

CPU requests (e.g., "2" or "2000m")

gpu integer

GPU count required per pod For multi-GPU inference, each pod gets this many GPUs Note: Multi-GPU sharding config comes from Model CRD

format: int32

minimum: 0

maximum: 8

gpuMemory string

GPUMemory specifies GPU memory limit per pod (e.g., "16Gi") Used for scheduling and validation

gpuSharing object

GPUSharing declares how this InferenceService consumes its GPU: exclusively (whole device, the default), as a hardware partition (e.g. NVIDIA MIG), or co-resident with other workloads on a shared device. Sharing is a serving-time decision, which is why it lives here rather than on the Model: the same Model can run exclusive in production and shared in dev. Unset means exclusive, preserving the behavior of every existing manifest.

memoryLimitGiB integer

MemoryLimitGiB caps this service's device-memory footprint in shared mode. It drives quota accounting (a shared workload counts this many GiB against a GPUQuota vramBytes cap) and, where the runtime supports it, a memory-cap enforcement flag. Only valid for mode shared.

format: int32

minimum: 1

mode string

Mode selects the sharing tier. Defaults to exclusive.

enum: exclusive, shared, partitioned

profile string

Profile names the hardware partition to request. Required when mode is partitioned, forbidden otherwise. The string is vendor-specific; for NVIDIA MIG it is the profile name as exposed by the device plugin, e.g. "1g.24gb" or "3g.90gb", and resolves to the extended resource nvidia.com/mig-<profile>.

hostMemory string

HostMemory specifies the system RAM required for hybrid GPU/CPU offloading (e.g., "64Gi"). Used when MoE expert weights or KV cache are offloaded to CPU via moeCPUOffload or noKvOffload. Translated to pod resources.requests.memory, taking precedence over Memory when set. Without this, the K8s scheduler has no visibility into the pod's actual RAM consumption, which can lead to OOM kills after model load.

memory string

Memory requests (e.g., "4Gi")

revisionHistoryLimit integer

RevisionHistoryLimit caps how many old ReplicaSets the inference Deployment keeps for rollback. Unset uses the Kubernetes default (10); 0 keeps none.

format: int32

minimum: 0

rolloutPolicy object

RolloutPolicy controls how deployment updates are applied. When waitForIdle is true, the controller will check backend slot idleness before updating the Deployment pod-template. Idle detection support by runtime: - llama.cpp: native /slots endpoint (default) - vLLM: Prometheus metrics scrape (vllm:num_requests_running) - TGI: Prometheus metrics scrape (tgi_batch_current_size) - SGLang: Prometheus metrics scrape (sglang:num_running_reqs) - generic: optional AnnotationIdleEndpoint annotation for custom probe

force boolean

Force bypasses the idle check and proceeds with the rollout immediately. When true, WaitForIdle is ignored. Useful for emergency rollouts or when slots are stuck in a non-idle state.

idleTimeoutSeconds integer

IdleTimeoutSeconds is the maximum time to wait for slots to become idle before proceeding with the rollout regardless of slot state. Defaults to 300 (5 minutes) when omitted or set to 0.

minimum: 0

waitForIdle boolean

WaitForIdle indicates whether to wait for all backend slots to report idle before applying a Deployment pod-template update. When true, the controller probes each replica and defers the rollout until all replicas are idle or the idleTimeoutSeconds expires. Idle detection is runtime-specific: llama.cpp uses /slots, vLLM/TGI/SGLang scrape Prometheus gauges, and generic runtimes may set AnnotationIdleEndpoint for a custom HTTP probe. Runtimes without idle detection support proceed immediately with ReasonIdleCheckUnsupported.

ropeScaling object

RopeScaling configures RoPE-based context extension so a model can be served past its native trained context (e.g. 128K served at 256K via YaRN). For the llamacpp runtime this maps to --rope-scaling / --rope-scale / --yarn-orig-ctx. Prefer this over raw spec.extraArgs: it is validated and discoverable via `kubectl explain`. If --rope-scaling is also present in spec.extraArgs, extraArgs wins and this is skipped.

factor string

Factor is the scale multiplier (--rope-scale), e.g. "2.0" to double the native context. A string to avoid CRD float pitfalls; the runtime parses it as a float. Optional.

pattern: ^[0-9]+(\.[0-9]+)?$

originalContext integer

OriginalContext is the model's native training context length (--yarn-orig-ctx), e.g. 131072 for a 128K model. Recommended with yarn.

format: int32

minimum: 128

type string required

Type is the scaling method (--rope-scaling). "yarn" is the usual choice for extending context (e.g. 128K to 256K).

enum: linear, yarn, longrope

runtime string

Runtime selects the inference server backend. "llamacpp" (default): llama.cpp server with auto-generated args and /health probes. "llamacpp-router": llama.cpp server in router mode for multi-model dynamic loading. "generic": user-provided container with custom command, args, env, and probes. "personaplex": NVIDIA PersonaPlex (Moshi) speech-to-speech server. "vllm": vLLM OpenAI-compatible server with PagedAttention. "tgi": HuggingFace Text Generation Inference server. "sglang": SGLang OpenAI-compatible server with RadixAttention prefix caching.

enum: llamacpp, llamacpp-router, personaplex, vllm, tgi, sglang, generic

runtimeClassName string

RuntimeClassName selects a Kubernetes RuntimeClass for the inference Pod. Most commonly set to "nvidia" on clusters where the NVIDIA Container Runtime is not configured as the cluster default. Without it, GPU pods schedule onto the GPU node but never get the device files bind-mounted, and the container fails at runtime with "no CUDA-capable device is detected". Maps directly to PodSpec.RuntimeClassName. Most clusters running the NVIDIA GPU Operator with the default toolkit env do not need this set; it is a safety hatch for clusters where the runtime configuration is non-default.

securityContext object

SecurityContext defines container-level security attributes for the inference container.

allowPrivilegeEscalation boolean

AllowPrivilegeEscalation controls whether a process can gain more privileges than its parent process. This bool directly controls if the no_new_privs flag will be set on the container process. AllowPrivilegeEscalation is true always when the container is: 1) run as Privileged 2) has CAP_SYS_ADMIN Note that this field cannot be set when spec.os.name is windows.

appArmorProfile object

appArmorProfile is the AppArmor options to use by this container. If set, this profile overrides the pod's appArmorProfile. Note that this field cannot be set when spec.os.name is windows.

localhostProfile string

type string required

capabilities object

The capabilities to add/drop when running containers. Defaults to the default set of capabilities granted by the container runtime. Note that this field cannot be set when spec.os.name is windows.

add []string

Added capabilities

drop []string

Removed capabilities

privileged boolean

Run container in privileged mode. Processes in privileged containers are essentially equivalent to root on the host. Defaults to false. Note that this field cannot be set when spec.os.name is windows.

procMount string

procMount denotes the type of proc mount to use for the containers. The default value is Default which uses the container runtime defaults for readonly paths and masked paths. Note that this field cannot be set when spec.os.name is windows.

readOnlyRootFilesystem boolean

Whether this container has a read-only root filesystem. Default is false. Note that this field cannot be set when spec.os.name is windows.

runAsGroup integer

The GID to run the entrypoint of the container process. Uses runtime default if unset. May also be set in PodSecurityContext. If set in both SecurityContext and PodSecurityContext, the value specified in SecurityContext takes precedence. Note that this field cannot be set when spec.os.name is windows.

format: int64

runAsNonRoot boolean

Indicates that the container must run as a non-root user. If true, the Kubelet will validate the image at runtime to ensure that it does not run as UID 0 (root) and fail to start the container if it does. If unset or false, no such validation will be performed. May also be set in PodSecurityContext. If set in both SecurityContext and PodSecurityContext, the value specified in SecurityContext takes precedence.

runAsUser integer

The UID to run the entrypoint of the container process. Defaults to user specified in image metadata if unspecified. May also be set in PodSecurityContext. If set in both SecurityContext and PodSecurityContext, the value specified in SecurityContext takes precedence. Note that this field cannot be set when spec.os.name is windows.

format: int64

seLinuxOptions object

The SELinux context to be applied to the container. If unspecified, the container runtime will allocate a random SELinux context for each container. May also be set in PodSecurityContext. If set in both SecurityContext and PodSecurityContext, the value specified in SecurityContext takes precedence. Note that this field cannot be set when spec.os.name is windows.

level string

Level is SELinux level label that applies to the container.

role string

Role is a SELinux role label that applies to the container.

type string

Type is a SELinux type label that applies to the container.

user string

User is a SELinux user label that applies to the container.

seccompProfile object

The seccomp options to use by this container. If seccomp options are provided at both the pod & container level, the container options override the pod options. Note that this field cannot be set when spec.os.name is windows.

localhostProfile string

type string required

windowsOptions object

The Windows specific settings applied to all containers. If unspecified, the options from the PodSecurityContext will be used. If set in both SecurityContext and PodSecurityContext, the value specified in SecurityContext takes precedence. Note that this field cannot be set when spec.os.name is linux.

gmsaCredentialSpec string

GMSACredentialSpec is where the GMSA admission webhook (https://github.com/kubernetes-sigs/windows-gmsa) inlines the contents of the GMSA credential spec named by the GMSACredentialSpecName field.

gmsaCredentialSpecName string

GMSACredentialSpecName is the name of the GMSA credential spec to use.

hostProcess boolean

runAsUserName string

sglangConfig object

SGLangConfig holds configuration for the SGLang runtime. Only used when Runtime is "sglang".

attentionBackend string

Attention AttentionBackend selects the attention implementation. flashinfer is fastest on recent NVIDIA GPUs; flash_attn is portable; torch_native is the fallback. Maps to SGLang --attention-backend flag.

enum: flashinfer, flash_attn, torch_native

chatTemplate string

ChatTemplate overrides the model's bundled chat template. Maps to SGLang --chat-template flag.

chunkedPrefillSize integer

Batching ChunkedPrefillSize sets the chunk size for chunked prefill (tokens). Maps to SGLang --chunked-prefill-size flag.

format: int32

minimum: 512

contextLength integer

Memory & context ContextLength sets the maximum model context length. Maps to SGLang --context-length flag.

format: int32

minimum: 128

dataParallelSize integer

DataParallelSize sets the number of data-parallel replicas (SGLang-side controller). Maps to SGLang --dp flag. Not auto-derived; set explicitly. NOTE: at present this only sets the in-process SGLang `--dp` flag. Multi-replica rendezvous (SGLang's --dist-init-addr + a stable network identity per pod, e.g. headless service + StatefulSet) is not yet wired into the InferenceService controller and is tracked at https://github.com/defilantech/LLMKube/issues/1102. Setting this on an InferenceService with replicas > 1 will leave each replica starting as its own DP-1 group; operators wanting true DP coordination should hold off on this flag until #1102 lands.

format: int32

minimum: 1

enablePrefixCaching boolean

EnablePrefixCaching turns on RadixAttention automatic prefix caching. Headline feature for agentic workloads with shared system-prompt + tool-definition + repo-context prefixes. Maps to SGLang --enable-prefix-caching.

expertParallelSize integer

ExpertParallelSize sets the number of GPUs for expert parallelism (MoE models). Maps to SGLang --ep flag. Not auto-derived; set explicitly.

format: int32

minimum: 1

hfTokenSecretRef object

HFTokenSecretRef references a Secret containing the HuggingFace token. Injected as HF_TOKEN env var.

key string required

The key of the secret to select from. Must be a valid secret key.

name string

optional boolean

Specify whether the Secret or its key must be defined

kvCacheCustomDtype string

KVCacheCustomDtype sets a custom SGLang KV cache type not in the standard enum. Maps to SGLang --kv-cache-dtype flag. Takes precedence over KVCacheDtype when both are set. LLMKube does not validate the string.

kvCacheDtype string

KVCacheDtype selects the KV cache element type. auto follows dtype. fp8_e5m2 / fp8_e4m3 cut KV memory roughly in half. Custom values not in the enum (e.g., TurboQuant) go in KVCacheCustomDtype. Maps to SGLang --kv-cache-dtype flag.

enum: auto, fp8_e5m2, fp8_e4m3

logLevel string

LogLevel sets the SGLang server log level. SGLang accepts "debug"/"info"/"warning"/"error". Maps to SGLang --log-level flag.

enum: debug, info, warning, error

loraAdapters []object

LoraAdapters is a typed replacement for LoraModules. Each adapter has a stable Name (SGLang-side handle) and Path (file mount). When both LoraAdapters and LoraModules are set, LoraAdapters wins on name collision. Maps to SGLang --lora-paths flag (singular `lora_paths`, not vLLM's --lora-modules — see https://github.com/sgl-project/sglang/blob/v0.5.15/python/sglang/srt/server_args.py).

name string required

Name is the SGLang-side adapter handle used in inference requests.

minLength: 1

path string required

Path is the path on disk inside the SGLang container where the adapter weights are mounted.

minLength: 1

loraModules []string

LoRA (basic) LoraModules is the legacy form of --lora-paths entries. Each element is either `name=path` shorthand or a JSON object {"name":"x","path":"/p"}. New callers should prefer the typed LoraAdapters field; the controller merges both, with LoraAdapters winning on name collision. Deprecated: use LoraAdapters instead.

loraTargetModules []string

LoraTargetModules lists the modules LoRA adapters may target (e.g., "q_proj", "k_proj"). Maps to SGLang --lora-target-modules flag.

maxLoraRank integer

MaxLoraRank sets the maximum LoRA rank accepted at load time. Maps to SGLang --max-lora-rank flag.

format: int32

minimum: 1

maxRunningRequests integer

MaxRunningRequests caps concurrent in-flight requests. Maps to SGLang --max-running-requests flag. Spec.parallelSlots on the llama.cpp runtime is the analog; SGLang uses its own name.

format: int32

minimum: 1

memFractionStatic number

MemFractionStatic sets the fraction of GPU memory used for static state (model weights + KV cache). Range 0.1-0.99. Requires GPU. Maps to SGLang --mem-fraction-static flag.

minimum: 0.1

maximum: 0.99

quantization string

Quantization & KV cache Quantization sets the quantization method. SGLang accepts fp8/awq/gptq/modelopt. Maps to SGLang --quantization flag.

reasoningParser string

ReasoningParser selects the reasoning-content extraction format. For thinking models (qwen3, deepseek-r1). Maps to SGLang --reasoning-parser.

enum: qwen3, deepseek-r1

skipTokenizerInit boolean

SkipTokenizerInit skips tokenizer initialization at startup. Useful for prefill-only disaggregation deployments. Maps to SGLang --skip-tokenizer-init flag. Omit to leave SGLang's default.

speculative object

Speculative configures speculative decoding (EAGLE / EAGLE3 / Medusa).

acceptThresholdAcc number

AcceptThresholdAcc sets the acceptance threshold for the bonus token in accepted-token-sequence verification (an accepted draft token's probability must exceed p * accept_threshold_acc). Valid only when Enabled is true; surface a status condition when set otherwise. Maps to SGLang --speculative-accept-threshold-acc flag.

minimum: 0

maximum: 1

acceptThresholdSingle number

AcceptThresholdSingle sets the acceptance threshold for non-matched tokens in single-sequence decoding (a draft token is accepted when its probability exceeds p * accept_threshold_single). Valid only when Enabled is true; surface a status condition when set otherwise. Maps to SGLang --speculative-accept-threshold-single flag.

minimum: 0

maximum: 1

algorithm string

Algorithm selects the speculative algorithm (EAGLE, EAGLE3, Medusa). Maps to SGLang --speculative-algorithm flag.

enum: EAGLE, EAGLE3, Medusa

draftModelPath string

DraftModelPath is the path to the draft model weights (for EAGLE). Maps to SGLang --speculative-draft-model-path flag. Required when Enabled.

eagleTopK integer

EagleTopK is the top-k sampling for EAGLE draft tokens. Maps to SGLang --speculative-eagle-topk flag.

format: int32

minimum: 1

enabled boolean

Enabled toggles speculative decoding on. When false or nil, no flags are emitted regardless of other fields.

numDraftTokens integer

NumDraftTokens is the number of draft tokens proposed per step. Maps to SGLang --speculative-num-draft-tokens flag.

format: int32

minimum: 1

numSteps integer

NumSteps is the number of draft steps per forward pass. Maps to SGLang --speculative-num-steps flag.

format: int32

minimum: 1

tensorParallelSize integer

Sharding TensorParallelSize sets the number of GPUs for tensor parallelism. Maps to SGLang --tp flag.

format: int32

minimum: 1

toolCallParser string

Agentic glue ToolCallParser selects the tool-call extraction format. For foreman tool-loop workloads. Maps to SGLang --tool-call-parser flag.

enum: llama3, qwen3, qwen25, hermes, functionary, mistral

trustRemoteCode boolean

TrustRemoteCode allows loading remote code from the HuggingFace Hub model repo. Mirrors the flag on other runtimes. Maps to SGLang --trust-remote-code flag. Omit to leave SGLang's default.

skipModelInit boolean

SkipModelInit disables the model-downloader init container. Use when the model is baked into the image or downloaded by the container itself (e.g., via HF_TOKEN).

slo object

SLO declares a service-level objective for this inference service. When set (and the operator runs with --enable-pyrra-slo), the controller creates a Pyrra ServiceLevelObjective in the same namespace; Pyrra generates the recording and alert rules. Requires Pyrra installed in the cluster (https://github.com/pyrra-dev/pyrra).

indicator string

Indicator selects the measured signal. "availability" is scrape success of the serving pod (Prometheus `up`); "latency" is the fraction of requests completing under latencyThreshold. Latency is currently supported on the vllm runtime only (llama.cpp exports no request-latency histogram).

enum: availability, latency

latencyThreshold string

LatencyThreshold is the request-duration bound in seconds (e.g. "2" or "0.5") a request must beat to count as good. Required when indicator is latency. Must match a histogram bucket boundary of the runtime's latency metric (see docs/observability/slo.md).

pattern: ^[0-9]+(\.[0-9]+)?$

name string

Name is the SLO identifier shown in Pyrra and Grafana. Defaults to "<inferenceservice-name>-<indicator>".

objective string required

Objective is the target as a percentage string between 50 and 99.999, e.g. "99.5". A string because CRD validation cannot express float64 (controller-tools #245); Pyrra's own target field has the same shape and this value passes through unchanged.

pattern: ^[0-9]+(\.[0-9]+)?$

window string

Window is the rolling window the objective is measured over.

pattern: ^[0-9]+[mhdw]$

speculativeDecoding object

SpeculativeDecoding configures speculative decoding for the llama.cpp runtime using MTP (Multi-Token Prediction) or draft-model decoding. Maps to llama.cpp --spec-type and --draft-n-max flags. Only the "llamacpp" runtime supports this field; other runtimes must not set it.

nDraftMax integer

NDraftMax is the maximum number of draft tokens to propose per step (--draft-n-max). Only emitted when set; llama.cpp uses its own default otherwise.

format: int32

minimum: 1

type string required

Type is the speculative decoding method (--spec-type). "mtp" maps to draft-mtp, "draft" maps to draft, and "disabled" (or omitting the entire SpeculativeDecoding block) means no speculative decoding.

enum: mtp, draft, disabled

tensorOverrides []string

TensorOverrides provides fine-grained tensor placement overrides for power users. Each entry specifies a tensor name and target device (e.g., "exps=CPU", "token_embd=CUDA0"). Maps to llama.cpp --override-tensor flag (one flag per entry).

tgiConfig object

TGIConfig holds configuration for the TGI runtime. Only used when Runtime is "tgi".

dtype string

Dtype sets the model data type (float16, bfloat16).

enum: float16, bfloat16

hfTokenSecretRef object

HFTokenSecretRef references a Secret containing the HuggingFace token.

key string required

The key of the secret to select from. Must be a valid secret key.

name string

optional boolean

Specify whether the Secret or its key must be defined

maxInputLength integer

MaxInputLength sets the maximum input token length.

format: int32

maxTotalTokens integer

MaxTotalTokens sets the maximum total tokens (input + output).

format: int32

quantize string

Quantize sets the quantization method (bitsandbytes, gptq, awq, eetq).

enum: bitsandbytes, gptq, awq, eetq

tolerations []object

Tolerations for pod scheduling (e.g., GPU taints, spot instances)

effect string

Effect indicates the taint effect to match. Empty means match all taint effects. When specified, allowed values are NoSchedule, PreferNoSchedule and NoExecute.

key string

Key is the taint key that the toleration applies to. Empty means match all taint keys. If the key is empty, operator must be Exists; this combination means to match all values and all keys.

operator string

Operator represents a key's relationship to the value. Valid operators are Exists, Equal, Lt, and Gt. Defaults to Equal. Exists is equivalent to wildcard for value, so that a pod can tolerate all taints of a particular category. Lt and Gt perform numeric comparisons (requires feature gate TaintTolerationComparisonOperators).

tolerationSeconds integer

TolerationSeconds represents the period of time the toleration (which must be of effect NoExecute, otherwise this field is ignored) tolerates the taint. By default, it is not set, which means tolerate the taint forever (do not evict). Zero and negative values will be treated as 0 (evict immediately) by the system.

format: int64

value string

Value is the taint value the toleration matches to. If the operator is Exists, the value should be empty, otherwise just a regular string.

topologySpreadConstraints []object

TopologySpreadConstraints control how inference Pods are spread across topology domains (e.g. one model server per GPU node). Passthrough to the Pod spec; combine with PodLabels so the constraint's labelSelector can match sibling GPU workloads for a soft, cross-app spread.

labelSelector object

LabelSelector is used to find matching pods. Pods that match this label selector are counted to determine the number of pods in their corresponding topology domain.

matchExpressions []object

matchExpressions is a list of label selector requirements. The requirements are ANDed.

key string required

key is the label key that the selector applies to.

operator string required

operator represents a key's relationship to a set of values. Valid operators are In, NotIn, Exists and DoesNotExist.

values []string

matchLabels object

matchLabelKeys []string

MatchLabelKeys is a set of pod label keys to select the pods over which spreading will be calculated. The keys are used to lookup values from the incoming pod labels, those key-value labels are ANDed with labelSelector to select the group of existing pods over which spreading will be calculated for the incoming pod. The same key is forbidden to exist in both MatchLabelKeys and LabelSelector. MatchLabelKeys cannot be set when LabelSelector isn't set. Keys that don't exist in the incoming pod labels will be ignored. A null or empty list means only match against labelSelector. This is a beta field and requires the MatchLabelKeysInPodTopologySpread feature gate to be enabled (enabled by default).

maxSkew integer required

MaxSkew describes the degree to which pods may be unevenly distributed. When `whenUnsatisfiable=DoNotSchedule`, it is the maximum permitted difference between the number of matching pods in the target topology and the global minimum. The global minimum is the minimum number of matching pods in an eligible domain or zero if the number of eligible domains is less than MinDomains. For example, in a 3-zone cluster, MaxSkew is set to 1, and pods with the same labelSelector spread as 2/2/1: In this case, the global minimum is 1. | zone1 | zone2 | zone3 | | P P | P P | P | - if MaxSkew is 1, incoming pod can only be scheduled to zone3 to become 2/2/2; scheduling it onto zone1(zone2) would make the ActualSkew(3-1) on zone1(zone2) violate MaxSkew(1). - if MaxSkew is 2, incoming pod can be scheduled onto any zone. When `whenUnsatisfiable=ScheduleAnyway`, it is used to give higher precedence to topologies that satisfy it. It's a required field. Default value is 1 and 0 is not allowed.

format: int32

minDomains integer

MinDomains indicates a minimum number of eligible domains. When the number of eligible domains with matching topology keys is less than minDomains, Pod Topology Spread treats "global minimum" as 0, and then the calculation of Skew is performed. And when the number of eligible domains with matching topology keys equals or greater than minDomains, this value has no effect on scheduling. As a result, when the number of eligible domains is less than minDomains, scheduler won't schedule more than maxSkew Pods to those domains. If value is nil, the constraint behaves as if MinDomains is equal to 1. Valid values are integers greater than 0. When value is not nil, WhenUnsatisfiable must be DoNotSchedule. For example, in a 3-zone cluster, MaxSkew is set to 2, MinDomains is set to 5 and pods with the same labelSelector spread as 2/2/2: | zone1 | zone2 | zone3 | | P P | P P | P P | The number of domains is less than 5(MinDomains), so "global minimum" is treated as 0. In this situation, new pod with the same labelSelector cannot be scheduled, because computed skew will be 3(3 - 0) if new Pod is scheduled to any of the three zones, it will violate MaxSkew.

format: int32

nodeAffinityPolicy string

NodeAffinityPolicy indicates how we will treat Pod's nodeAffinity/nodeSelector when calculating pod topology spread skew. Options are: - Honor: only nodes matching nodeAffinity/nodeSelector are included in the calculations. - Ignore: nodeAffinity/nodeSelector are ignored. All nodes are included in the calculations. If this value is nil, the behavior is equivalent to the Honor policy.

nodeTaintsPolicy string

NodeTaintsPolicy indicates how we will treat node taints when calculating pod topology spread skew. Options are: - Honor: nodes without taints, along with tainted nodes for which the incoming pod has a toleration, are included. - Ignore: node taints are ignored. All nodes are included. If this value is nil, the behavior is equivalent to the Ignore policy.

topologyKey string required

TopologyKey is the key of node labels. Nodes that have a label with this key and identical values are considered to be in the same topology. We consider each <key, value> as a "bucket", and try to put balanced number of pods into each bucket. We define a domain as a particular instance of a topology. Also, we define an eligible domain as a domain whose nodes meet the requirements of nodeAffinityPolicy and nodeTaintsPolicy. e.g. If TopologyKey is "kubernetes.io/hostname", each Node is a domain of that topology. And, if TopologyKey is "topology.kubernetes.io/zone", each zone is a domain of that topology. It's a required field.

whenUnsatisfiable string required

WhenUnsatisfiable indicates how to deal with a pod if it doesn't satisfy the spread constraint. - DoNotSchedule (default) tells the scheduler not to schedule it. - ScheduleAnyway tells the scheduler to schedule the pod in any location, but giving higher precedence to topologies that would help reduce the skew. A constraint is considered "Unsatisfiable" for an incoming pod if and only if every possible node assignment for that pod would violate "MaxSkew" on some topology. For example, in a 3-zone cluster, MaxSkew is set to 1, and pods with the same labelSelector spread as 3/1/1: | zone1 | zone2 | zone3 | | P P P | P | P | If WhenUnsatisfiable is set to DoNotSchedule, incoming pod can only be scheduled to zone2(zone3) to become 3/2/1(3/1/2) as ActualSkew(2-1) on zone2(zone3) satisfies MaxSkew(1). In other words, the cluster can still be imbalanced, but scheduler won't make it *more* imbalanced. It's a required field.

turboQuantBits integer

TurboQuantBits sets the KV cache quantization bit width for the oMLX runtime (3, 6, or 8). Maps to oMLX --kv-cache-quant. When set, the oMLX daemon uses TurboQuant to compress the KV cache, reducing memory usage by up to 67% with minimal speed impact (~7% overhead). Only meaningful for the omlx runtime; ignored by llamacpp and other runtimes. Requires oMLX v0.3.4+ (which introduced 3-bit TurboQuant) or a later dev build (6-bit and 8-bit options).

enum: 3, 6, 8

format: int32

uBatchSize integer

UBatchSize sets the micro-batch size for decoding. Smaller micro-batches reduce memory usage during generation. Maps to llama.cpp --ubatch-size flag.

format: int32

minimum: 1

vllmConfig object

VLLMConfig holds configuration for the vLLM runtime. Only used when Runtime is "vllm".

attentionBackend string

AttentionBackend selects the attention implementation used by vLLM. FLASHINFER is typically fastest on recent NVIDIA GPUs (especially Blackwell); FLASH_ATTN is a solid default; XFORMERS and torch_sdpa are portability fallbacks. Requires a vLLM version that supports the chosen backend. Both uppercase (vLLM's native form) and lowercase spellings are accepted for backwards compatibility with earlier LLMKube releases. Maps to vLLM --attention-backend flag.

enum: FLASH_ATTN, FLASHINFER, XFORMERS, flashinfer, flash_attn, xformers, torch_sdpa

cpuOffloadGB integer

CPUOffloadGB increases the GPU memory size. When set, passes --cpu-offload-gb to vLLM. Per-rank, so 4 on TP=2 means 4 GB of CPU RAM per GPU. Use when FP8 model weights don't fit VRAM. Throughput hit is 2-5x on the offloaded path.

format: int32

minimum: 0

dtype string

Dtype sets the model data type (auto, float16, bfloat16).

enum: auto, float16, bfloat16

enableChunkedPrefill boolean

EnableChunkedPrefill interleaves long prefills with decode steps so a large paste (e.g. a 32K-token file) does not starve concurrent decode streams. Only emitted when explicitly set to true. Maps to vLLM --enable-chunked-prefill flag.

enableExpertParallel boolean

EnableExpertParallel distributes MoE experts across tensor-parallel ranks instead of replicating them. Only meaningful for MoE models. Maps to vLLM --enable-expert-parallel flag.

enablePrefixCaching boolean

EnablePrefixCaching turns on vLLM's automatic prefix caching for repeated prompts. Significantly reduces time-to-first-token for conversational and agentic workloads where requests share a common system prompt. Only emitted when explicitly set to true — when nil or false, vLLM's own default is used (do not emit the flag). Maps to vLLM --enable-prefix-caching flag.

gpuMemoryUtilization number

GPUMemoryUtilization controls how much GPU memory each stage can use. When set, passes --gpu-memory-utilization to vLLM. Range from 0.1 - 0.99 and default unset (vLLM uses 0.90).

minimum: 0.1

maximum: 0.99

hfTokenSecretRef object

HFTokenSecretRef references a Secret containing the HuggingFace token.

key string required

The key of the secret to select from. Must be a valid secret key.

name string

optional boolean

Specify whether the Secret or its key must be defined

kvCacheCustomDtype string

KVCacheCustomDtype sets a custom vLLM KV cache element type that is not in the standard enum. Used for vLLM versions with additional cache formats such as TurboQuant 2-bit (turbo2, shipped in v0.20.0). Maps to vLLM --kv-cache-dtype. The runtime image must understand the value or vLLM will fail to start; LLMKube does not validate the string. Mirrors the llama.cpp-side CacheTypeCustomK/V escape hatch. Takes precedence over KVCacheDtype when both are set.

kvCacheDtype string

KVCacheDtype selects the KV cache element type. fp8_e5m2 and fp8_e4m3 cut KV cache memory roughly in half versus auto (which follows dtype), which is what unlocks 128K+ context on consumer VRAM for agentic workloads. Maps to vLLM --kv-cache-dtype flag. For custom build types not in the enum (e.g. TurboQuant turbo2 from vLLM v0.20+), use KVCacheCustomDtype instead.

enum: auto, fp8_e5m2, fp8_e4m3

maxModelLen integer

MaxModelLen sets the maximum model context length.

format: int32

maxNumBatchedTokens integer

MaxNumBatchedTokens sets the maximum number of tokens batched together per step. This is the main throughput knob: too low means prefill-bound, too high risks OOM on long context. No default — only emitted when set. Maps to vLLM --max-num-batched-tokens flag.

format: int32

minimum: 512

quantization string

Quantization method. awq, gptq, squeezellm are classic 4-bit formats. fp8 targets 8-bit FP checkpoints (Qwen FP8, Llama FP8, etc.). nvfp4 is NVIDIA's Blackwell-native 4-bit format. compressed-tensors is the neuralmagic/vLLM cross-format loader used by Unsloth and other recent releases.

enum: awq, gptq, squeezellm, fp8, nvfp4, compressed-tensors

speculative object

Speculative enables draft-model speculative decoding. On single-stream agentic workloads this can be 30-60% faster than plain tensor-parallel execution. Requires a second (smaller) Model CR to act as the draft.

enabled boolean

Enabled toggles speculative decoding on. When false or nil, no speculative flags are emitted regardless of other fields.

model string

Model references the Model CR (in the same namespace as the InferenceService) to use as the speculative draft model. Required when Enabled is true. If missing, speculative decoding is skipped and the InferenceService surfaces a SpeculativeInvalid status condition rather than failing the reconcile. Maps to vLLM --speculative-model flag.

numSpeculativeTokens integer

NumSpeculativeTokens is the number of draft tokens proposed per step. Typical sweet spot is 3-5; higher values increase wasted work when the draft disagrees with the target model. Maps to vLLM --num-speculative-tokens flag.

format: int32

minimum: 1

maximum: 16

tensorParallelSize integer

TensorParallelSize sets the number of GPUs for tensor parallelism.

format: int32

status object

status defines the observed state of InferenceService

conditions []object

conditions represent the current state of the InferenceService resource. Each condition has a unique type and reflects the status of a specific aspect of the resource. Standard condition types include: - "Available": the resource is fully functional - "Progressing": the resource is being created or updated - "Degraded": the resource failed to reach or maintain its desired state The status of each condition is one of True, False, or Unknown.

lastTransitionTime string required

lastTransitionTime is the last time the condition transitioned from one status to another. This should be when the underlying condition changed. If that is not known, then using the time when the API field changed is acceptable.

format: date-time

message string required

message is a human readable message indicating details about the transition. This may be an empty string.

maxLength: 32768

observedGeneration integer

observedGeneration represents the .metadata.generation that the condition was set based upon. For instance, if .metadata.generation is currently 12, but the .status.conditions[x].observedGeneration is 9, the condition is out of date with respect to the current state of the instance.

format: int64

minimum: 0

reason string required

reason contains a programmatic identifier indicating the reason for the condition's last transition. Producers of specific condition types may define expected values and meanings for this field, and whether the values are considered a guaranteed API. The value should be a CamelCase string. This field may not be empty.

pattern: ^[A-Za-z]([A-Za-z0-9_,:]*[A-Za-z0-9_])?$

minLength: 1

maxLength: 1024

status string required

status of the condition, one of True, False, Unknown.

enum: True, False, Unknown

type string required

type of condition in CamelCase or in foo.example.com/CamelCase.

pattern: ^([a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*/)?(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])$

maxLength: 316

desiredReplicas integer

DesiredReplicas is the desired number of replicas

format: int32

effectivePriority integer

EffectivePriority shows the resolved priority value from the applied PriorityClass

format: int32

endpoint string

Endpoint is the service URL where inference requests can be sent

gateway object

Gateway reports the result of Envoy AI Gateway exposure for this InferenceService. Populated only when spec.endpoint.gateway is enabled. nil means no gateway exposure was requested (or the gateway integration is disabled because the aigw CRDs are not installed; that case is also surfaced via the GatewayReady condition).

authEnabled boolean

AuthEnabled indicates a SecurityPolicy enforcing JWT authentication was compiled for this route (ModelRouter policy.auth.jwt). Set by the ModelRouter dataPlane: Gateway path; false when no auth is configured.

endpoint string

Endpoint is the gateway address clients send OpenAI requests to. Set by the ModelRouter dataPlane: Gateway path (resolved from the referenced Gateway); empty for the InferenceService path.

modelName string

ModelName is the resolved model-name match value clients send as the OpenAI "model" string to reach this InferenceService through the gateway. Set by the InferenceService path; empty for ModelRouter (which fronts many model names).

routeReady boolean

RouteReady indicates the AIGatewayRoute (and its backing Backend + AIServiceBackend) were reconciled successfully against the gateway.

lastUpdated string

LastUpdated is the timestamp of the last status update

format: date-time

mode string

Mode is the resolved serving mode (chat, embedding, or rerank): spec.mode when set, otherwise inferred from the runtime flags and endpoint path.

modelReady boolean

ModelReady indicates if the referenced Model is in Ready state

phase string

Phase represents the current lifecycle phase of the InferenceService. Possible values: Pending, Creating, Progressing, Ready, WaitingForGPU, Stopped, Failed. Stopped is the terminal state when spec.replicas=0 has caused the agent to tear down the workload; tooling polling for readiness should treat Stopped the same as Pending (the user intentionally took the service offline; this is not an error).

enum: Pending, Creating, Progressing, Ready, WaitingForGPU, Stopped, Failed

queuePosition integer

QueuePosition indicates position among pending InferenceServices cluster-wide (0 = not queued)

format: int32

readyReplicas integer

Replicas tracks the number of ready vs desired pods

format: int32

replicas integer

Replicas is the current number of running inference pods

format: int32

schedulingMessage string

SchedulingMessage provides details about scheduling issues

schedulingStatus string

SchedulingStatus indicates why pods cannot be scheduled (e.g., "InsufficientGPU")

waitingFor string

WaitingFor describes the resource constraint (e.g., "nvidia.com/gpu: 1")

No matches. Try .spec.affinity for an exact path