TCO Impact
1. Hardware Acquisition (CapEx)
GPUs:
High upfront cost: NVIDIA H100 ~$25kβ$35k per card; AMD MI300X similar.
Supply-constrained, long lead times, often bundled with premium OEM systems.
CPUs:
Commodity pricing, already included in existing servers.
Adding AMX/AVX-512 support is free if you already run Sapphire Rapids / EPYC Zen 4.
Arm Graviton instances in cloud are cheaper per vCPU than GPU nodes.
π Impact: GPUs raise CapEx dramatically; CPUs leverage sunk infrastructure.
2. Operational Costs (OpEx)
Power & Cooling:
GPUs draw 350β700W per card (8x H100 system = 5β7 kW).
CPUs typically 150β300W per socket (dual-socket Xeon β 400β500W total).
In datacenter terms: energy per token is lower for GPU at scale (large batches), but higher for GPU at low utilization.
Cloud Pricing:
GPU instances (e.g., AWS p5d.24xlarge) are $90β$98/hour.
CPU instances (m7i.4xlarge Xeon AMX) are <$1/hour.
For small/quantized models, CPU cost per million tokens can actually beat GPU.
π Impact: GPUs are more power-hungry and expensive hourly, but amortize better at high throughput. CPUs win when workloads are bursty or low-QPS.
3. Utilization Factor
GPUs must be fully loaded (batching, continuous streams) to justify TCO. Idle GPU capacity is wasted CapEx + OpEx.
CPUs scale elastically β already deployed for general compute, so inference can βborrowβ unused cycles.
π Impact: TCO for GPUs is highly sensitive to utilization; CPUs are more forgiving.
4. Engineering / Software Overhead
GPUs:
Require specialized kernels (CUDA, ROCm, Triton, paged attention).
Higher engineering investment in serving infra (batch schedulers, KV-cache offload).
CPUs:
Leverage mainstream stacks (PyTorch + oneDNN, OpenVINO, ONNX Runtime).
Easier to integrate into existing server workflows.
π Impact: CPU inference reduces dev/ops overhead β lower βhidden TCO.β
5. Cost per Token (Illustrative)
3B INT8
Xeon AMX (m7i.xlarge)
~80β100
$0.20
~$0.0007
3B INT8
A10 GPU (g5.xlarge)
~400β500
$1.00
~$0.0005
13B INT8
Xeon AMX (m7i.4xlarge)
~50
$0.80
~$0.003β0.004
13B BF16
A100 (p4d.24xlarge)
~1,500+
$32
~$0.001β0.002
70B BF16
Xeon AMX
<10
$3+
$0.03β0.05 (impractical)
70B BF16
H100 (p5.48xlarge)
3,000β4,000
$98
~$0.002β0.003
π Impact:
For small/quantized models, CPUs are close or better in cost/token.
For large models, GPUs dominate cost/token by an order of magnitude.
6. Strategic TCO Takeaway
GPU-only strategy: High CapEx/OpEx, but best for large models and high concurrency. Risks: overprovisioning, idle burn.
CPU-only strategy: Cheap and abundant, but throughput collapses beyond 13β20B models.
Hybrid strategy (best TCO):
Route agents, RAG, 3β13B quantized models β CPUs (AMX/AVX-512/SME2).
Route chat, deep research, β₯30B models β GPUs.
This maximizes utilization of expensive GPUs, while lowering idle cost and power draw.
π Bottom line:
GPUs maximize performance, but TCO is only optimal if you keep them fully utilized.
CPUs reduce TCO for smaller models, bursty traffic, and quantized workloads.
The most cost-effective infrastructure is heterogeneous, dynamically routing workloads by model size, concurrency, and SLA.
Last updated