# TCO Impact

### 1. Hardware Acquisition (CapEx)

* **GPUs:**
  * **High upfront cost**: NVIDIA H100 \~$25k–$35k per card; AMD MI300X similar.
  * **Supply-constrained**, long lead times, often bundled with premium OEM systems.
* **CPUs:**
  * Commodity pricing, already included in existing servers.
  * Adding AMX/AVX-512 support is *free* if you already run Sapphire Rapids / EPYC Zen 4.
  * Arm Graviton instances in cloud are cheaper per vCPU than GPU nodes.

👉 **Impact:** GPUs raise CapEx dramatically; CPUs leverage sunk infrastructure.

***

### 2. Operational Costs (OpEx)

* **Power & Cooling:**
  * GPUs draw **350–700W per card** (8x H100 system = 5–7 kW).
  * CPUs typically **150–300W per socket** (dual-socket Xeon ≈ 400–500W total).
  * In datacenter terms: **energy per token** is lower for GPU *at scale* (large batches), but higher for GPU *at low utilization*.
* **Cloud Pricing:**
  * GPU instances (e.g., AWS p5d.24xlarge) are **$90–$98/hour**.
  * CPU instances (m7i.4xlarge Xeon AMX) are **<$1/hour**.
  * For small/quantized models, CPU cost per million tokens can actually beat GPU.

👉 **Impact:** GPUs are more power-hungry and expensive hourly, but amortize better at high throughput. CPUs win when workloads are bursty or low-QPS.

***

### 3. Utilization Factor

* **GPUs must be fully loaded** (batching, continuous streams) to justify TCO. Idle GPU capacity is wasted CapEx + OpEx.
* **CPUs scale elastically** — already deployed for general compute, so inference can “borrow” unused cycles.

👉 **Impact:** TCO for GPUs is highly sensitive to utilization; CPUs are more forgiving.

***

### 4. Engineering / Software Overhead

* **GPUs:**
  * Require specialized kernels (CUDA, ROCm, Triton, paged attention).
  * Higher engineering investment in serving infra (batch schedulers, KV-cache offload).
* **CPUs:**
  * Leverage mainstream stacks (PyTorch + oneDNN, OpenVINO, ONNX Runtime).
  * Easier to integrate into existing server workflows.

👉 **Impact:** CPU inference reduces dev/ops overhead → lower “hidden TCO.”

***

### 5. Cost per Token (Illustrative)

| Workload | Hardware               | Tokens/sec  | $/hour (AWS est.) | $ per 1M tokens          |
| -------- | ---------------------- | ----------- | ----------------- | ------------------------ |
| 3B INT8  | Xeon AMX (m7i.xlarge)  | \~80–100    | $0.20             | \~$0.0007                |
| 3B INT8  | A10 GPU (g5.xlarge)    | \~400–500   | $1.00             | \~$0.0005                |
| 13B INT8 | Xeon AMX (m7i.4xlarge) | \~50        | $0.80             | \~$0.003–0.004           |
| 13B BF16 | A100 (p4d.24xlarge)    | \~1,500+    | $32               | \~$0.001–0.002           |
| 70B BF16 | Xeon AMX               | <10         | $3+               | $0.03–0.05 (impractical) |
| 70B BF16 | H100 (p5.48xlarge)     | 3,000–4,000 | $98               | \~$0.002–0.003           |

👉 **Impact:**

* For **small/quantized models**, CPUs are close or better in cost/token.
* For **large models**, GPUs dominate cost/token by an order of magnitude.

***

### 6. Strategic TCO Takeaway

* **GPU-only strategy**: High CapEx/OpEx, but best for large models and high concurrency. Risks: overprovisioning, idle burn.
* **CPU-only strategy**: Cheap and abundant, but throughput collapses beyond 13–20B models.
* **Hybrid strategy** (best TCO):
  * Route **agents, RAG, 3–13B quantized models** → CPUs (AMX/AVX-512/SME2).
  * Route **chat, deep research, ≥30B models** → GPUs.
  * This maximizes utilization of expensive GPUs, while lowering idle cost and power draw.

***

📌 **Bottom line:**

* GPUs maximize performance, but TCO is only optimal if you keep them fully utilized.
* CPUs reduce TCO for smaller models, bursty traffic, and quantized workloads.
* The most cost-effective infrastructure is **heterogeneous**, dynamically routing workloads by **model size, concurrency, and SLA**.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.cortensor.network/technical-architecture/ai-inference/cpu-instruction-sets-for-llm-inference-avx-amx-sme-vs-gpus/tco-impact.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
