Cost Analysis
1. Dimensions of Cost
CapEx (hardware acquisition):
GPUs (H100/MI300X): very high upfront, scarce supply, long lead times.
CPUs (Xeon/EPYC/Graviton): abundant, cheaper per socket, widely available across cloud and bare-metal.
OpEx (operational cost):
Power & cooling: GPUs draw 350–700W per card; CPUs typically 100–300W per socket.
Licensing & infra: GPU clouds often include premium pricing; CPU instances are commodity-priced.
Developer/engineering cost:
GPUs need specialized frameworks (CUDA, ROCm, Triton, paged attention).
CPUs leverage mainstream libraries (PyTorch + oneDNN, ONNX Runtime, OpenVINO).
2. Cost per Token (Illustrative)
Small LLM (3B, INT8)
Xeon AMX (m7i.xlarge)
~80–100
~$0.20/hr
~$0.0007
Cheap, CPU competitive
Small LLM (3B, INT8)
A10 GPU (g5.xlarge)
~400–500
~$1.00/hr
~$0.0005
GPU slightly better but higher hourly rate
Mid LLM (13B, INT8)
Xeon AMX (m7i.4xlarge)
~50
~$0.80/hr
~$0.003–0.004
CPUs slow down, cost rises
Mid LLM (13B, INT8/BF16)
A100 (p4d.24xlarge)
~1,500+
~$32/hr
~$0.001–0.002
GPUs more efficient at this scale
Large LLM (70B, BF16)
Xeon AMX (not practical)
<10
~$3/hr+
$0.03–0.05
Not cost-effective
Large LLM (70B, BF16)
H100 (p5.48xlarge)
~3,000–4,000
~$98/hr
~$0.002–0.003
Best option for massive models
(Numbers are ballpark, based on AWS public pricing + reported throughput; adjust with your own benchmarks.)
3. When CPUs Are More Cost-Effective
Quantized small/mid models (≤13B)
Bursty/low-QPS traffic (agents, RAG, edge workloads)
Tokenization, embedding, pre/post-processing tasks
Commodity cloud or on-prem deployments where GPUs are scarce/overpriced
4. When GPUs Win on Cost
Large models (≥30–70B) where CPU throughput collapses
High-concurrency workloads where GPU batching drives cost per token down
Training or fine-tuning (CPUs not viable)
5. Strategic Takeaway
Cost efficiency is workload-dependent.
CPUs with AMX/AVX/SME are cheapest for smaller quantized models, agents, and spiky traffic.
GPUs dominate cost-per-token for large, steady, chat/research workloads.
The most economical architecture is heterogeneous:
Route agents & utilities → CPUs
Route long-form inference → GPUs
Last updated