Model
inference.llmkube.dev / v1alpha1
apiVersion: inference.llmkube.dev/v1alpha1
kind: Model
metadata:
name: example
apiVersion
string
APIVersion defines the versioned schema of this representation of an object.
Servers should convert recognized schemas to the latest internal value, and
may reject unrecognized values.
More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources
kind
string
Kind is a string value representing the REST resource this object represents.
Servers may infer this from the endpoint the client submits requests to.
Cannot be updated.
In CamelCase.
More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds
metadata
object
spec object required
spec defines the desired state of Model
format
string
Format specifies the model file format.
"gguf" is used with the llama-server runtime; "mlx" is used with the oMLX runtime;
"safetensors", "pytorch", and "custom" are used with the generic runtime.
enum:
gguf, mlx, safetensors, pytorch, customhardware object
Hardware specifies hardware acceleration preferences
accelerator
string
Accelerator specifies the type of hardware acceleration
enum:
cpu, metal, cuda, rocm, intelgpu object
GPU specifies GPU device requirements
count
integer
Count specifies the number of GPUs required
Supports multi-GPU for model sharding (future feature)
format:
int32minimum:
0maximum:
8
enabled
boolean
Enabled indicates whether GPU acceleration is enabled
layers
integer
Layers specifies layer offloading configuration for multi-GPU
Format: number of layers to offload to GPU (e.g., 32 for full offload on 7B model)
-1 means auto-detect optimal layer split
format:
int32minimum:
-1
memory
string
Memory specifies minimum GPU memory required per GPU (e.g., "8Gi", "16Gi")
sharding object
Sharding defines how to shard the model across multiple GPUs
Only applicable when Count > 1
layerSplit
[]string
LayerSplit defines custom layer splits per GPU
Example: [0-15, 16-31] for 2-GPU split of 32-layer model
If empty, auto-calculate even split
strategy
string
Strategy defines the sharding approach for multi-GPU model execution.
- "layer" (default): shard by transformer layers. llama.cpp --split-mode layer.
- "tensor" (alias: "row"): true tensor parallelism. llama.cpp --split-mode row.
Splits each tensor operation across GPUs rather than assigning whole layers
to each. Performance varies by workload; typically better on compute-bound ops.
- "none": disable multi-GPU sharding (single GPU). llama.cpp --split-mode none.
- "pipeline": accepted for forward compatibility but currently falls back to
"layer" with a reconciler warning; llama.cpp has no pipeline split-mode.
enum:
layer, tensor, row, pipeline, none
vendor
string
Vendor specifies GPU vendor preference (nvidia, amd, intel)
Future-proof for multi-vendor support
enum:
nvidia, amd, intel
memoryBudget
string
MemoryBudget is an absolute memory limit for the model process
(e.g., "24Gi", "8192Mi"). When set, it takes precedence over
MemoryFraction and the agent-level --memory-fraction flag.
Parsed via resource.ParseQuantity().
memoryFraction
number
MemoryFraction is the fraction of total system memory to budget for
this model's inference process (0.0–1.0). Takes precedence over the
agent-level --memory-fraction flag but not MemoryBudget.
quantization
string
Quantization describes the quantization level (e.g., Q4_0, Q5_K_M, F16)
resources object
Resources defines resource requirements for running the model
cpu
string
CPU specifies CPU requirements (e.g., "2" or "2000m")
memory
string
Memory specifies memory requirements (e.g., "4Gi")
sha256
string
SHA256 is the expected SHA256 hash of the model file for integrity verification.
When set, the controller verifies the downloaded/copied file matches this hash.
pattern:
^[a-fA-F0-9]{64}$
source
string required
Source defines where to obtain the model.
For GGUF models: URL or path to a .gguf file.
For MLX models: local directory path containing the model (config.json, weights).
Supported schemes: http://, https://, file://, pvc://, or absolute paths.
Examples:
- https://huggingface.co/org/repo/resolve/main/model.gguf
- file:///mnt/models/model.gguf
- /mnt/models/model.gguf (air-gapped deployments)
- pvc://my-models-pvc/path/to/model.gguf (pre-staged on a PersistentVolumeClaim)
- /mnt/models/Llama-3.2-3B-Instruct-4bit (MLX model directory)
file:// caveat for hybrid topologies: the controller pod must be
able to read the path. In Mac kind / k3s / GKE deployments where
the metal-agent runs on the host and the controller runs inside a
container, /Users/... and other host paths are invisible to the
controller and will fail to fetch. The controller marks the Model
Failed and backs off to a 5-minute requeue rather than retrying
tightly (#405). Workaround: pre-stage on a pvc://, or use the
equivalent https://huggingface.co/.../<filename>.gguf URL which
the runtime/init container resolves at deploy time.
pattern:
^(https?|file|pvc)://.*|^/[^\s]+$|^[a-zA-Z0-9][\w\-\.\/]+$status object
status defines the observed state of Model
acceleratorReady
boolean
AcceleratorReady indicates if hardware acceleration is configured and ready
cacheKey
string
CacheKey is the SHA256 hash prefix of the source URL used for cache storage
Models with the same source URL share the same cache entry
conditions []object
conditions represent the current state of the Model resource.
Each condition has a unique type and reflects the status of a specific aspect of the resource.
Standard condition types include:
- "Available": the model is downloaded and ready for use
- "Progressing": the model is being downloaded or processed
- "Degraded": the model download or setup failed
The status of each condition is one of True, False, or Unknown.
lastTransitionTime
string required
lastTransitionTime is the last time the condition transitioned from one status to another.
This should be when the underlying condition changed. If that is not known, then using the time when the API field changed is acceptable.
format:
date-time
message
string required
message is a human readable message indicating details about the transition.
This may be an empty string.
maxLength:
32768
observedGeneration
integer
observedGeneration represents the .metadata.generation that the condition was set based upon.
For instance, if .metadata.generation is currently 12, but the .status.conditions[x].observedGeneration is 9, the condition is out of date
with respect to the current state of the instance.
format:
int64minimum:
0
reason
string required
reason contains a programmatic identifier indicating the reason for the condition's last transition.
Producers of specific condition types may define expected values and meanings for this field,
and whether the values are considered a guaranteed API.
The value should be a CamelCase string.
This field may not be empty.
pattern:
^[A-Za-z]([A-Za-z0-9_,:]*[A-Za-z0-9_])?$minLength:
1maxLength:
1024
status
string required
status of the condition, one of True, False, Unknown.
enum:
True, False, Unknown
type
string required
type of condition in CamelCase or in foo.example.com/CamelCase.
pattern:
^([a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*/)?(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])$maxLength:
316gguf object
GGUF contains metadata extracted from the GGUF file header
architecture
string
Architecture is the model architecture (e.g., "llama", "mistral", "phi")
contextLength
integer
ContextLength is the maximum context length (tokens)
format:
int64
embeddingSize
integer
EmbeddingSize is the embedding dimension size
format:
int64
fileVersion
integer
FileVersion is the GGUF file format version
format:
int32
headCount
integer
HeadCount is the number of attention heads
format:
int64
layerCount
integer
LayerCount is the number of transformer layers/blocks
format:
int64
license
string
License is the license identifier extracted from the GGUF file metadata
modelName
string
ModelName is the model name as stored in the GGUF file
quantization
string
Quantization is the quantization type (e.g., "Q4_K_M", "Q5_K_M")
tensorCount
integer
TensorCount is the number of tensors in the model
format:
int64
lastUpdated
string
LastUpdated is the timestamp of the last status update
format:
date-time
path
string
Path represents the local path where the model is stored
phase
string
Phase represents the current lifecycle phase of the model.
Possible values: Pending, Downloading, Copying, Ready, Failed.
enum:
Pending, Downloading, Copying, Ready, Failed
sha256
string
SHA256 is the computed SHA256 hash of the model file.
Populated after download/copy for integrity tracking.
size
string
Size represents the size of the downloaded model file
No matches. Try .spec.format for an exact path