Cost Analysis

1. Dimensions of Cost

  • CapEx (hardware acquisition):

    • GPUs (H100/MI300X): very high upfront, scarce supply, long lead times.

    • CPUs (Xeon/EPYC/Graviton): abundant, cheaper per socket, widely available across cloud and bare-metal.

  • OpEx (operational cost):

    • Power & cooling: GPUs draw 350–700W per card; CPUs typically 100–300W per socket.

    • Licensing & infra: GPU clouds often include premium pricing; CPU instances are commodity-priced.

  • Developer/engineering cost:

    • GPUs need specialized frameworks (CUDA, ROCm, Triton, paged attention).

    • CPUs leverage mainstream libraries (PyTorch + oneDNN, ONNX Runtime, OpenVINO).


2. Cost per Token (Illustrative)

Workload
Hardware
Tokens/sec
Instance Price (cloud est.)
$ per 1M Tokens
Notes

Small LLM (3B, INT8)

Xeon AMX (m7i.xlarge)

~80–100

~$0.20/hr

~$0.0007

Cheap, CPU competitive

Small LLM (3B, INT8)

A10 GPU (g5.xlarge)

~400–500

~$1.00/hr

~$0.0005

GPU slightly better but higher hourly rate

Mid LLM (13B, INT8)

Xeon AMX (m7i.4xlarge)

~50

~$0.80/hr

~$0.003–0.004

CPUs slow down, cost rises

Mid LLM (13B, INT8/BF16)

A100 (p4d.24xlarge)

~1,500+

~$32/hr

~$0.001–0.002

GPUs more efficient at this scale

Large LLM (70B, BF16)

Xeon AMX (not practical)

<10

~$3/hr+

$0.03–0.05

Not cost-effective

Large LLM (70B, BF16)

H100 (p5.48xlarge)

~3,000–4,000

~$98/hr

~$0.002–0.003

Best option for massive models

(Numbers are ballpark, based on AWS public pricing + reported throughput; adjust with your own benchmarks.)


3. When CPUs Are More Cost-Effective

  • Quantized small/mid models (≤13B)

  • Bursty/low-QPS traffic (agents, RAG, edge workloads)

  • Tokenization, embedding, pre/post-processing tasks

  • Commodity cloud or on-prem deployments where GPUs are scarce/overpriced


4. When GPUs Win on Cost

  • Large models (≥30–70B) where CPU throughput collapses

  • High-concurrency workloads where GPU batching drives cost per token down

  • Training or fine-tuning (CPUs not viable)


5. Strategic Takeaway

  • Cost efficiency is workload-dependent.

    • CPUs with AMX/AVX/SME are cheapest for smaller quantized models, agents, and spiky traffic.

    • GPUs dominate cost-per-token for large, steady, chat/research workloads.

  • The most economical architecture is heterogeneous:

    • Route agents & utilities → CPUs

    • Route long-form inference → GPUs

Last updated