Performance & Benchmark
Here’s a summary of relevant publicly available benchmarks relating to AMX/AVX and comparisons with GPUs.
OpenMetal “Intel AMX Enables High-Efficiency CPU Inference for AI Workloads”
Intel Xeon 4th-/5th-gen with AMX
llama 3.2B, quantized (Q4_K_M or Q8_0); also some lower bit quantization (4-bit)
With AMX: up to ~57 tokens/sec for 3.2B Q8_0 / ~28 t/s without AMX. With 4-bit, up to ~80 t/s in some configs. (OpenMetal IaaS)
These are modest-size models; outputs vs inputs size, batch sizes etc matter. The latency to first token is also worse relative to GPUs. This is “inference on CPU, quantized, AMX enabled vs disabled” rather than direct comparison to say H100 or A100.
Presidio blog: “LLMs on Intel Xeon CPUs with Intel AMX”
AWS EC2 m7i.8xlarge (4th Gen Intel Xeon with AMX) vs AWS GPU instance (p3.2xlarge)
Using “Neural chat” type LLMs; comparing generic prompts and RAG (retrieval) prompts; with INT8 quantization on CPU + AMX
For generic prompts: CPU+AMX achieved ~100 t/s (tokens/sec), vs maybe ~25-27 t/s without AMX. For RAG prompts: CPU AMX ~120 t/s, vs ~35-40 t/s without AMX. GPU instance still has higher throughput & lower latency, but the gap is smaller for quantized/INT8 models. (Presidio)
In RAG, prompt includes large input contexts which stresses memory. Also, test sizes & prompt lengths differ; plus “first token latency” on CPU tends to be higher. Also model size here is in tens of billions or smaller (for massive 70-405B models GPUs start to excel more).
AMD vs NVIDIA Inference Benchmark: Who Wins?
MI300X / MI325X vs H100 / H200 etc.
Large dense models like LLaMA3-405B; FP16 or FP8 precision in many cases; focusing on throughput under certain latency constraints.
Some results: MI300X outperforms H200 (with less optimized stack) and H100 in some “large-model, memory bound” scenarios; MI325X beats H100 & H200 in several cases. Throughputs of hundreds to ~1000 tokens/sec per GPU when latency = ~some constraint. (SemiAnalysis)
This is purely GPU vs GPU; useful for understanding GPU scaling but doesn’t compare with CPU/AMX/SME.
Arm SME2 (Lumex CSS / C1 CPU cluster)
SME2-enabled Armv9.3 CPUs (C1 cluster)
On-device AI tasks: audio generation, speech, etc.
Up to 5× uplift in AI performance over prior generation CPUs; ~2.8× faster audio generation; ~4.7× lower latency for speech workloads. (Arm Newsroom)
These are not for huge LLM (> 10-100B) inference, they’re more “on-device / mobile / sub-flagship CPU” tasks. Not direct comparison with server GPUs. And “AI performance” is loosely defined (could be a mix of tasks simpler than full transformer decoding).
“Hello SME! Generating Fast Matrix Multiplication Kernels Using the Scalable Matrix Extension”
Apple M4 chip (with SME)
Microbenchmarks of small matrix multiplications; FP32 (and maybe fixed-point)
SME on M4 achieved over 2.3 FP32 TFLOPS for certain matrix sizes; SME kernels beat vendor BLAS implementations for small matrix sizes in most tested configurations. (arXiv)
Microbenchmarks with small matrices are good indicators for matmul / attention building blocks, but they don’t fully reflect whole model inference (which also includes softmax, layernorm, tokenization etc.). Also, precision matters: FP32 vs quantized / INT8 / BF16 etc.
Comparing CPU (AMX / AVX) vs GPU — Magnitude of Gaps
Putting together what’s known:
For small-to-medium sized models (say ~1-10B parameters), with quantization (INT8 / lower bits) and on CPUs with AMX/AVX, token throughput can reach tens to low hundreds of tokens/sec. GPUs of course do thousands of tokens/sec in those same models.
From Presidio: CPU with AMX + INT8 got ~100 t/s (generic prompts) vs maybe what a GPU instance could do (depending on which) — but the GPU was still faster and lower latency. (Presidio)
In the OpenMetal case, enabling AMX roughly doubles throughput vs without. So the CPU “wins” in that optimization sense, but it doesn’t close the gap to high-end GPUs for large models. (OpenMetal IaaS)
SME2 on mobile/on-device gives large multipliers compared to previous generation CPUs, but still not yet in the same class of throughput as server-grade GPUs, especially for larger LLMs.
What the Expected Gaps Are / Where GPU Still Wins
Based on the data and what we know of hardware:
Throughput (tokens/sec): GPUs still dominate especially as model size grows, when precision is FP16/BF16 or even FP8, and when batch size / context length is large.
Latency to First Token: GPUs tend to have much lower “warm up” and can start generation quickly. CPUs might have higher overheads, especially on the first token or for small batches / long contexts.
Memory size & bandwidth: GPU memory (HBM, large VRAM) gives a big advantage for large models; CPUs are often constrained by system memory bandwidth / cache latency if trying to load large model weights, attention state etc.
Cost Efficiency at Lower Scales: CPUs with AMX (or AVX) become more competitive when model sizes are modest, quantization is used, and throughput requirements are moderate. They may even beat GPUs in some “servers owned & operated” TCO when GPU hardware, power, cooling, etc. are factored in.
Energy use / power draw: For on-device or low batch settings, CPUs may be more efficient (per watt) vs powering big GPUs. But for sustained high throughput, GPUs may come out ahead in joules/token.
Specific Approximate Comparisons / “Rule of Thumb” Based on Available Data
Here are some rough numbers / heuristics extrapolated from the sources above + known GPU performance:
Small model (~3B param), quantized INT8, batch small
~50-100 t/s achievable on Xeon + AMX (with quantization) (OpenMetal IaaS)
Likely thousands of tokens/sec on a good GPU; easily 10×-30× faster in these scenarios
Medium model (~20-70B), FP16/BF16 or quantized, long context, high batch
CPU will often struggle: maybe tens of t/s, plus memory constraints, might push GPU bumps into hundreds to thousands of t/s, depending on stack and precision
GPUs shine here; especially H100 / MI300X / similar cards scale to higher contexts and larger batch sizes with much better throughput
First token latency
CPU + AMX might have first token in hundreds of ms (sometimes <200ms in good quantized paths) for small/medium models; RAG / large context raises that further. Presidio reports ~ <50ms in good setups for first token in “generic prompts” with AMX + quantization. (Presidio)
GPUs typically give tens of ms for first token in many production settings (with efficient I/O, optimized stack)
Gaps in Public Data (Open Points)
I didn’t find many direct head-to-head benchmarks of AMX CPU vs top GPUs on large LLMs (70-400B), with the same quantization, same batch, same prompt size. Means we don’t have a fully apples-to-apples gap that is widely published.
Similarly, SME / SME2 benchmarks in LLM inference (especially for large models) are less public; many SME2 results are mobile-/on-device oriented or microbenchmarked matrix multiplication, not full model decoding / generation.
Also quantization formats, whether sparse operators are used, etc., make huge differences. Some CPU benchmarks use more aggressive quantization or pruning etc than GPU benchmarks, which can skew what “tokens/sec” means in terms of accuracy.
Takeaways: How Much GPU Beats CPU (AMX/AVX/SME) + Where CPU Is Catching Up
Putting this all together:
For large models (say >50-100B params) and where you need high throughput / many concurrent requests, GPUs remain the practical choice. The performance gaps are still large—often 5× to 30× or more depending on configuration.
For medium or small models (1-20B), especially quantized, CPU with AMX can close the gap significantly. Sometimes within 2×-5× of GPU throughput (depending on model, precision, batch size etc.), especially when you optimize carefully.
On latency / cost per model served in low-throughput or intermittent settings, CPU + AMX / AVX / SME may be more cost-effective; you pay less for hardware, power, cooling, and maybe less engineering overhead around GPU fleet management.
SME2 on mobile / device side is making strides, but not yet replacing server GPU performance for large LLMs.
Last updated