TCO Impact

1. Hardware Acquisition (CapEx)

  • GPUs:

    • High upfront cost: NVIDIA H100 ~$25k–$35k per card; AMD MI300X similar.

    • Supply-constrained, long lead times, often bundled with premium OEM systems.

  • CPUs:

    • Commodity pricing, already included in existing servers.

    • Adding AMX/AVX-512 support is free if you already run Sapphire Rapids / EPYC Zen 4.

    • Arm Graviton instances in cloud are cheaper per vCPU than GPU nodes.

👉 Impact: GPUs raise CapEx dramatically; CPUs leverage sunk infrastructure.


2. Operational Costs (OpEx)

  • Power & Cooling:

    • GPUs draw 350–700W per card (8x H100 system = 5–7 kW).

    • CPUs typically 150–300W per socket (dual-socket Xeon ≈ 400–500W total).

    • In datacenter terms: energy per token is lower for GPU at scale (large batches), but higher for GPU at low utilization.

  • Cloud Pricing:

    • GPU instances (e.g., AWS p5d.24xlarge) are $90–$98/hour.

    • CPU instances (m7i.4xlarge Xeon AMX) are <$1/hour.

    • For small/quantized models, CPU cost per million tokens can actually beat GPU.

👉 Impact: GPUs are more power-hungry and expensive hourly, but amortize better at high throughput. CPUs win when workloads are bursty or low-QPS.


3. Utilization Factor

  • GPUs must be fully loaded (batching, continuous streams) to justify TCO. Idle GPU capacity is wasted CapEx + OpEx.

  • CPUs scale elastically — already deployed for general compute, so inference can “borrow” unused cycles.

👉 Impact: TCO for GPUs is highly sensitive to utilization; CPUs are more forgiving.


4. Engineering / Software Overhead

  • GPUs:

    • Require specialized kernels (CUDA, ROCm, Triton, paged attention).

    • Higher engineering investment in serving infra (batch schedulers, KV-cache offload).

  • CPUs:

    • Leverage mainstream stacks (PyTorch + oneDNN, OpenVINO, ONNX Runtime).

    • Easier to integrate into existing server workflows.

👉 Impact: CPU inference reduces dev/ops overhead → lower “hidden TCO.”


5. Cost per Token (Illustrative)

Workload
Hardware
Tokens/sec
$/hour (AWS est.)
$ per 1M tokens

3B INT8

Xeon AMX (m7i.xlarge)

~80–100

$0.20

~$0.0007

3B INT8

A10 GPU (g5.xlarge)

~400–500

$1.00

~$0.0005

13B INT8

Xeon AMX (m7i.4xlarge)

~50

$0.80

~$0.003–0.004

13B BF16

A100 (p4d.24xlarge)

~1,500+

$32

~$0.001–0.002

70B BF16

Xeon AMX

<10

$3+

$0.03–0.05 (impractical)

70B BF16

H100 (p5.48xlarge)

3,000–4,000

$98

~$0.002–0.003

👉 Impact:

  • For small/quantized models, CPUs are close or better in cost/token.

  • For large models, GPUs dominate cost/token by an order of magnitude.


6. Strategic TCO Takeaway

  • GPU-only strategy: High CapEx/OpEx, but best for large models and high concurrency. Risks: overprovisioning, idle burn.

  • CPU-only strategy: Cheap and abundant, but throughput collapses beyond 13–20B models.

  • Hybrid strategy (best TCO):

    • Route agents, RAG, 3–13B quantized models → CPUs (AMX/AVX-512/SME2).

    • Route chat, deep research, ≥30B models → GPUs.

    • This maximizes utilization of expensive GPUs, while lowering idle cost and power draw.


📌 Bottom line:

  • GPUs maximize performance, but TCO is only optimal if you keep them fully utilized.

  • CPUs reduce TCO for smaller models, bursty traffic, and quantized workloads.

  • The most cost-effective infrastructure is heterogeneous, dynamically routing workloads by model size, concurrency, and SLA.

Last updated