| Property | Value |
|---|---|
| Provider | Mistral AI |
| GGUF architecture key | mistral3 |
| Source class | Mistral3Model |
| Vision encoder | Mistral3VisionEncoder (Pixtral) |
| Image processor | Mistral3ImageProcessor |
| Example models | Mistral-Small-3.1-24B-Instruct |
| Modalities | Text, image |
| Thinking mode | No |
| Tool calling | No |
| Output parser | PassthroughOutputParser |
Mistral 3 is Mistral's third-generation LLaMA-style dense transformer with two distinguishing extensions:
- YaRN-corrected RoPE — partial, per-frequency interpolation between the original RoPE schedule and an extrapolated one. This lets the model attend cleanly across context lengths well past the original training context.
- Position-dependent Q scaling — once the position exceeds the original
context length, Q is multiplied by
1 + β · log(1 + ⌊pos / origCtx⌋), which keeps attention magnitudes well-conditioned at long ranges. - Pixtral vision encoder for image input.
Otherwise the architecture is the standard LLaMA / Mistral recipe: GQA, no QK-norm, SwiGLU FFN, RMSNorm, GPT-J style RoPE pairings, optional tied LM head.
tokens (int[])
│
token_embd.weight
│
[optional] InjectVisionEmbeddings (image tokens)
│
┌──── × NumLayers ──────────────────────────────┐
│ RMSNorm(attn_norm) │
│ Q, K, V projections (or fused QKV) │
│ RoPE_GPT-J(Q, K) with YaRN-corrected freqs │
│ Q ← Q * (1 + β · log(1 + ⌊pos / origCtx⌋)) │
│ attention(Q, KCache, VCache) (full causal) │
│ attn_output ─► residual │
│ RMSNorm(ffn_norm) │
│ ffn_gate_up ─► [gate ‖ up] │
│ SwiGLU: SiLU(gate) * up │
│ ffn_down ─► residual │
└────────────────────────────────────────────────┘
│
RMSNorm(output_norm)
│
LM head (output.weight or tied)
│
▼
logits
For each layer L:
hidden ─► RMSNorm(attn_norm.weight, eps)
─► QKV (fused via attn_qkv.weight when fused at load time, otherwise
three separate attn_q/k/v matmuls)
─► RoPE_GPT-J(Q, K, ropeFreqs[]) at positions startPos..startPos+seqLen-1
• GPT-J style: pairs adjacent elements (x[2i], x[2i+1])
• Frequencies are YaRN-corrected when ropeType == "yarn"
─► [if _ropeOrigCtx > 0]:
scale = 1 + _ropeScalingBeta * log(1 + floor(pos / _ropeOrigCtx))
Q ← Q * scale
─► Q ← Q * (1/sqrt(headDim))
─► append (K, V) to KV cache
─► attention(Q, KCache, VCache) # standard causal GQA
─► attn_output.weight matmul → o
─► hidden = hidden + o
─► h2 = RMSNorm(ffn_norm.weight, eps) on hidden
─► ffn_gate_up.weight matmul → [gate ‖ up]
─► g = SiLU(gate)
─► h3 = ffn_down.weight × (g * up)
─► hidden = hidden + h3
After all layers:
hidden ─► narrow(seq_len-1) if prefill
─► RMSNorm(output_norm.weight, eps)
─► matmul against output.weight (or token_embd.weight when tied)
─► copy to float[VocabSize]
- GQA with
Config.NumKVHeads < Config.NumHeads. Separatekey_lengthandvalue_lengthare supported (_attnKeyLen,_attnValLen); they default toConfig.HeadDimwhen not set. - Optional fused QKV.
FuseQKVWeights()buildsattn_qkv.weightwhen the GGUF stores Q / K / V separately. The decoder uses_layerQkvFused[l]to pick the fused matmul vs. the three separate matmuls per layer. - No QK-norm — unlike Qwen 3, Gemma 3, and Gemma 4, Mistral 3 skips the per-head RMSNorm of Q and K.
-
Pairing: adjacent elements
(x[2i], x[2i+1])are rotated together (GPT-J convention), as opposed to the(x[i], x[i + halfDim])pairing of NeoX. -
YaRN frequency correction: when
mistral3.rope.scaling.type == "yarn"andmistral3.rope.scaling.original_context_length > 0,ApplyYarnFreqCorrection()interpolates between extrapolated and interpolated frequencies based on whether each frequency band is in the "slow" or "fast" rotation range:lowFreqWavelen = origCtx / betaSlow highFreqWavelen = origCtx / betaFast for each freq f: wavelen = 2π / f if wavelen < highFreqWavelen: # high-freq band: extrapolate f stays elif wavelen > lowFreqWavelen: # low-freq band: interpolate f *= 1 / scale else: # medium band: smooth interp ramp = ... # linear ramp between the two f = mix(f, f / scale, ramp) -
Position-dependent Q scaling: when
_ropeOrigCtx > 0,q *= 1 + _ropeScalingBeta * log(1 + floor(pos / _ropeOrigCtx))This keeps attention magnitudes well-conditioned past the original context length.
ffn_gate_up.weight packs [gate ‖ up]. Ops.SiLUMul then computes
SiLU(gate) * up, followed by ffn_down. No biases.
token_embd.weight is the input embedding. The LM head uses
output.weight when present; otherwise it's tied to token_embd.weight
(_hasTiedOutput).
Mistral3VisionEncoder is the Pixtral architecture:
- Conv2D patch embedding:
v.patch_conv.weight(shape[hidden, channels, patchSize, patchSize]) with optional bias. - RMSNorm at the encoder input.
- 2D RoPE positional embedding for spatial position.
- SiLU-gated MLP transformer blocks: LayerNorm + multi-head attention + residual + LayerNorm + SiLU-gated MLP + residual.
- Spatial patch merging: groups neighboring patches into one merged token at the projector boundary.
- Multimodal projector: RMSNorm → PatchMerger → Linear → GELU → Linear, mapping to the LM hidden dim.
Mistral3ImageProcessor resizes images preserving aspect ratio:
- Composite RGBA over a white background.
- Resize so the longest edge equals
longest_edgewhile keeping aspect ratio. - Pad to a multiple of
patch_size. - Normalize with the CLIP default mean / std.
The multimodal injector (ProcessMistral3History in
ModelMultimodalInjector.cs) walks the chat history, runs the encoder once
per unique image, and queues a PreparedEmbeddingSpan so the embeddings
land at the right positions in the prompt before Forward().
| Key | Type | Meaning |
|---|---|---|
mistral3.rope.dimension_count |
uint32 | RoPE dimension count |
mistral3.rope.scaling.type |
string | RoPE scaling type (e.g. yarn) |
mistral3.attention.temperature_scale |
float32 | Position-dependent Q scaling β (default 0.1) |
mistral3.rope.scaling.original_context_length |
uint32 | YaRN original context length |
mistral3.rope.scaling.extrapolation_factor |
float32 | YaRN extrapolation factor (default 1.0) |
mistral3.rope.scaling.yarn_beta_fast |
float32 | YaRN fast rotation threshold (default 32.0) |
mistral3.rope.scaling.yarn_beta_slow |
float32 | YaRN slow rotation threshold (default 1.0) |
mistral3.rope.scaling.mscale |
float32 | YaRN mscale (default 0) |
mistral3.rope.scaling.mscale_all_dim |
float32 | YaRN mscale_all_dim (default 0) |
For the Pixtral vision projector (mmproj GGUF), the standard
clip.vision.* keys cover the encoder dims, plus mm.* keys for the
projector layers.
token_embd.weight # [vocab, hidden]
blk.{L}.attn_norm.weight # pre-attention RMSNorm
blk.{L}.attn_q.weight # Q projection (before fusion)
blk.{L}.attn_k.weight # K projection (before fusion)
blk.{L}.attn_v.weight # V projection (before fusion)
blk.{L}.attn_qkv.weight # fused Q+K+V (after fusion)
blk.{L}.attn_output.weight # output projection
blk.{L}.ffn_norm.weight # pre-FFN RMSNorm
blk.{L}.ffn_gate.weight } # before fusion
blk.{L}.ffn_up.weight }
blk.{L}.ffn_gate_up.weight # after fusion: [2*intermediate, hidden]
blk.{L}.ffn_down.weight # FFN down projection
output_norm.weight # final RMSNorm
output.weight # LM head (optional if tied)
v.patch_conv.weight # Conv2D patch embedding [hidden, C, P, P]
v.patch_conv.bias # Conv2D bias (optional)
v.encoder_norm.weight # encoder input RMSNorm
v.blk.{L}.attn_norm.weight # pre-attention RMSNorm
v.blk.{L}.attn_q.weight # Q projection
v.blk.{L}.attn_k.weight # K projection
v.blk.{L}.attn_v.weight # V projection
v.blk.{L}.attn_output.weight # output projection
v.blk.{L}.ffn_norm.weight # pre-FFN RMSNorm
v.blk.{L}.ffn_gate.weight # SiLU gate
v.blk.{L}.ffn_up.weight # up projection
v.blk.{L}.ffn_down.weight # down projection
mm.norm.weight # projector RMSNorm
mm.patch_merger.merging_layer.weight # spatial patch merger
mm.linear_1.weight # projector linear 1
mm.linear_2.weight # projector linear 2
Constructor (Mistral3Model(string ggufPath, BackendType backend)):
ParseBaseConfig().- Reads
_attnKeyLen/_attnValLenfromKeyLength/ValueLength(or defaults toHeadDim). Reads_ropeDim. - Reads YaRN parameters.
ParseTokenizer(),LoadWeights().FuseQKVWeights(),FuseGateUpWeights().PrepareCudaQuantizedWeightsForInference().InitKVCache(maxSeqLen)allocates per-layer K and V tensors of shape[NumKVHeads, maxSeqLen, headDim](separate_attnKeyLen/_attnValLenare honored).PrecomputeConstants()builds:- The per-layer weight name arrays — different shapes depending on whether QKV / gate-up are fused for that layer.
- The RoPE frequency table
_ropeFreqs[halfDim]. - YaRN-corrected frequencies via
ApplyYarnFreqCorrection()when_ropeType == "yarn"and_ropeOrigCtx > 0.
Forward(int[] tokens) runs the per-op managed loop:
- Embedding lookup.
- Optional vision injection at
<image_pad>-marked positions. - For each layer: RMSNorm, QKV (fused or split), RoPE with YaRN-corrected frequencies, optional position-dependent Q scaling, attention, output projection + residual, FFN, residual.
- For prefill, the residual is narrowed to the last token before
output_normso the LM head matmul produces a single row. - Final RMSNorm, LM head, copy to
_logitsBuffer.
- Fused QKV and gate / up cut the per-layer GEMM count.
- Last-row narrow before
output_normso the LM head matmul only produces[1, vocab]. - YaRN frequency table built once at load time. The decode path uses
the same
_ropeFreqs[]array. - Backend ops dispatch through
Ops.RoPEExandOps.SiLUMul, which run as native kernels on GGML / CUDA backends.
There is no chunked prefill (full attention only — no SWA), no fused per-layer prefill kernel, and no native whole-model decode for Mistral 3. Adding either would shrink decode latency and TTFT.
- Pre-resolved per-layer weight name arrays (
_layerWeightNames[L][]) allocated inPrecomputeConstants()keep the hot loop allocation-free. - The
_layerQkvFused[]flag picks the fused vs. split QKV path per layer without allocations. - Pre-allocated decode position arrays and YaRN-corrected
_ropeFreqs[]avoid recomputation per token. - Quantized weights stay quantized in
_quantWeights; matmul calls run through the backend's native quantized matmul.
- Per-layer K and V tensors of shape
[NumKVHeads, maxSeqLen, headDim]using_attnKeyLen/_attnValLenfor the per-head dim. ResetKVCache()zeroes everything and callsInvalidateTensorDeviceCache()to sync GPU state.TruncateKVCache(int tokenCount)is supported (used for multi-turn KV reuse on the server).- The Pixtral vision encoder dequantizes its weights to F32 at load time; the LM weights stay quantized when supported by the backend.
PassthroughOutputParser— Mistral 3 has no thinking / tool-call wire format.- Chat template uses Mistral's standard chat format
(
[INST]...[/INST]<s>...</s>). Falls back to the hardcoded template when the GGUF lacks a Jinja2 template. - The image placeholder is
<image_pad>andChatTemplate.ExpandImageTokensexpands one<image_pad>into the right number of placeholder tokens for the corresponding image's encoded length.
- Native whole-model decode — the entire forward pass is managed C#
with backend-dispatched matmul. A native single-call decode path
(analogous to Qwen 3's
TransformerModelDecode) would remove most of the managed overhead. - Fused single-graph decode — a
Mistral3ModelDecodeGGML kernel would significantly improve Metal / CUDA throughput by collapsing the per-layer dispatches. - Quantized vision encoder — Pixtral weights are currently dequantized to F32 at load time, increasing memory usage for image-heavy workloads. Supporting quantized vision weights directly would shrink the resident set and cut load time.
- Fused prefill attention — adopting Qwen 3.5's
FusedPrefillAttentionapproach would reduce per-layer prefill round-trips on GGML / CUDA.