Cost Analysis

1. Dimensions of Cost

CapEx (hardware acquisition):
- GPUs (H100/MI300X): very high upfront, scarce supply, long lead times.
- CPUs (Xeon/EPYC/Graviton): abundant, cheaper per socket, widely available across cloud and bare-metal.
OpEx (operational cost):
- Power & cooling: GPUs draw 350–700W per card; CPUs typically 100–300W per socket.
- Licensing & infra: GPU clouds often include premium pricing; CPU instances are commodity-priced.
Developer/engineering cost:
- GPUs need specialized frameworks (CUDA, ROCm, Triton, paged attention).
- CPUs leverage mainstream libraries (PyTorch + oneDNN, ONNX Runtime, OpenVINO).

2. Cost per Token (Illustrative)

Workload

Hardware

Tokens/sec

Instance Price (cloud est.)

$ per 1M Tokens

Notes

Small LLM (3B, INT8)

Xeon AMX (m7i.xlarge)

~80–100

~$0.20/hr

~$0.0007

Cheap, CPU competitive

Small LLM (3B, INT8)

A10 GPU (g5.xlarge)

~400–500

~$1.00/hr

~$0.0005

GPU slightly better but higher hourly rate

Mid LLM (13B, INT8)

Xeon AMX (m7i.4xlarge)

~50

~$0.80/hr

~$0.003–0.004

CPUs slow down, cost rises

Mid LLM (13B, INT8/BF16)

A100 (p4d.24xlarge)

~1,500+

~$32/hr

~$0.001–0.002

GPUs more efficient at this scale

Large LLM (70B, BF16)

Xeon AMX (not practical)

<10

~$3/hr+

$0.03–0.05

Not cost-effective

Large LLM (70B, BF16)

H100 (p5.48xlarge)

~3,000–4,000

~$98/hr

~$0.002–0.003

Best option for massive models

(Numbers are ballpark, based on AWS public pricing + reported throughput; adjust with your own benchmarks.)

3. When CPUs Are More Cost-Effective

Quantized small/mid models (≤13B)
Bursty/low-QPS traffic (agents, RAG, edge workloads)
Tokenization, embedding, pre/post-processing tasks
Commodity cloud or on-prem deployments where GPUs are scarce/overpriced

4. When GPUs Win on Cost

Large models (≥30–70B) where CPU throughput collapses
High-concurrency workloads where GPU batching drives cost per token down
Training or fine-tuning (CPUs not viable)

5. Strategic Takeaway

Cost efficiency is workload-dependent.
- CPUs with AMX/AVX/SME are cheapest for smaller quantized models, agents, and spiky traffic.
- GPUs dominate cost-per-token for large, steady, chat/research workloads.
The most economical architecture is heterogeneous:
- Route agents & utilities → CPUs
- Route long-form inference → GPUs

PreviousPerformance & Benchmark NextAgent vs Chat/Deep-Research Inference

Last updated 2 months ago