TCO Impact
1. Hardware Acquisition (CapEx)
GPUs:
High upfront cost: NVIDIA H100 ~$25k–$35k per card; AMD MI300X similar.
Supply-constrained, long lead times, often bundled with premium OEM systems.
CPUs:
Commodity pricing, already included in existing servers.
Adding AMX/AVX-512 support is free if you already run Sapphire Rapids / EPYC Zen 4.
Arm Graviton instances in cloud are cheaper per vCPU than GPU nodes.
👉 Impact: GPUs raise CapEx dramatically; CPUs leverage sunk infrastructure.
2. Operational Costs (OpEx)
Power & Cooling:
GPUs draw 350–700W per card (8x H100 system = 5–7 kW).
CPUs typically 150–300W per socket (dual-socket Xeon ≈ 400–500W total).
In datacenter terms: energy per token is lower for GPU at scale (large batches), but higher for GPU at low utilization.
Cloud Pricing:
GPU instances (e.g., AWS p5d.24xlarge) are $90–$98/hour.
CPU instances (m7i.4xlarge Xeon AMX) are <$1/hour.
For small/quantized models, CPU cost per million tokens can actually beat GPU.
👉 Impact: GPUs are more power-hungry and expensive hourly, but amortize better at high throughput. CPUs win when workloads are bursty or low-QPS.
3. Utilization Factor
GPUs must be fully loaded (batching, continuous streams) to justify TCO. Idle GPU capacity is wasted CapEx + OpEx.
CPUs scale elastically — already deployed for general compute, so inference can “borrow” unused cycles.
👉 Impact: TCO for GPUs is highly sensitive to utilization; CPUs are more forgiving.
4. Engineering / Software Overhead
GPUs:
Require specialized kernels (CUDA, ROCm, Triton, paged attention).
Higher engineering investment in serving infra (batch schedulers, KV-cache offload).
CPUs:
Leverage mainstream stacks (PyTorch + oneDNN, OpenVINO, ONNX Runtime).
Easier to integrate into existing server workflows.
👉 Impact: CPU inference reduces dev/ops overhead → lower “hidden TCO.”
5. Cost per Token (Illustrative)
3B INT8
Xeon AMX (m7i.xlarge)
~80–100
$0.20
~$0.0007
3B INT8
A10 GPU (g5.xlarge)
~400–500
$1.00
~$0.0005
13B INT8
Xeon AMX (m7i.4xlarge)
~50
$0.80
~$0.003–0.004
13B BF16
A100 (p4d.24xlarge)
~1,500+
$32
~$0.001–0.002
70B BF16
Xeon AMX
<10
$3+
$0.03–0.05 (impractical)
70B BF16
H100 (p5.48xlarge)
3,000–4,000
$98
~$0.002–0.003
👉 Impact:
For small/quantized models, CPUs are close or better in cost/token.
For large models, GPUs dominate cost/token by an order of magnitude.
6. Strategic TCO Takeaway
GPU-only strategy: High CapEx/OpEx, but best for large models and high concurrency. Risks: overprovisioning, idle burn.
CPU-only strategy: Cheap and abundant, but throughput collapses beyond 13–20B models.
Hybrid strategy (best TCO):
Route agents, RAG, 3–13B quantized models → CPUs (AMX/AVX-512/SME2).
Route chat, deep research, ≥30B models → GPUs.
This maximizes utilization of expensive GPUs, while lowering idle cost and power draw.
📌 Bottom line:
GPUs maximize performance, but TCO is only optimal if you keep them fully utilized.
CPUs reduce TCO for smaller models, bursty traffic, and quantized workloads.
The most cost-effective infrastructure is heterogeneous, dynamically routing workloads by model size, concurrency, and SLA.
Last updated