# Agent vs Chat/Deep-Research Inference

### Workload shapes (why they differ)

* **Agent inference** (tool-use, function calls, short bursts)
  * Spiky, I/O-bound, lots of small decode segments
  * Frequent context updates (tool outputs), smaller active batch
  * Latency tolerance varies by step; overall “time-to-task” matters more than raw tokens/sec
* **Chat / deep-research inference** (long context, long answers)
  * Sustained decoding, large prompts, bigger KV caches
  * Amenable to batching/throughput optimization
  * Sensitive to first-token latency and steady tokens/sec

### CPU vs GPU fit by workload

| Dimension                  | Agent (tools, routing, short bursts)                                                                                          | Chat / Deep-research (long ctx, long outputs)                                                                  |
| -------------------------- | ----------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------- |
| **Best silicon**           | **CPU-first** (AMX/AVX-512/SME) when models ≤13B and quantized (INT8/4-bit); GPUs for larger tools or vision/multimodal steps | **GPU-first**, esp. ≥13B, long contexts, or high concurrency; CPU viable for ≤7–13B quantized, low concurrency |
| **Batching gains**         | Low; steps are irregular → GPUs underutilize unless you coalesce across users                                                 | High; steady streams boost GPU utilization massively                                                           |
| **Perf/Watt**              | CPUs competitive for intermittent bursts (lower idle cost); SME2 compelling on-device                                         | GPUs win for sustained decoding and high batch                                                                 |
| **Latency to first token** | CPU can be good enough if model small + warm; otherwise GPU leads                                                             | GPU typically leads (kernel fusion, HBM)                                                                       |
| **Memory pressure**        | Lower (short prompts, short outputs per step)                                                                                 | High (long prompts, KV cache growth); favors GPUs with HBM or CPU with huge RAM but lower BW                   |
| **Ops complexity**         | Simple to scale horizontally with commodity CPUs                                                                              | GPU scheduling, batching, paged attention more involved but pays off at scale                                  |

### Practical routing rules (drop into Router/NodePool)

**By model size & precision**

* ≤7B, INT8/4-bit → **CPU preferred** (Xeon-AMX / EPYC-AVX512 / Arm-SME2 when available).
* 13B, INT8/4-bit → **CPU for agent; GPU for chat** (switch if prompt >16–32k or strict latency SLA).
* ≥33B or FP16/BF16 → **GPU** (both agent and chat), unless agent steps are rare and latency budget is loose.

**By prompt/context**

* Context ≤16k tokens → CPUs remain viable for agent; GPUs for long replies.
* Context >32k tokens or heavy RAG stitching → **GPU** (paged attention, KV offload efficiency).

**By concurrency**

* QPS < 2 per model instance (bursty agents) → **CPU** wins on TCO.
* QPS ≥ 5 with steady streams (chat, research) → **GPU** for utilization and joules/token.

**By SLA**

* p95 step latency ≤150 ms (agent tool loop) → small **CPU** models or **GPU** if model >13B / multimodal.
* First-token ≤75 ms and sustained ≥150 t/s/thread → **GPU**.

### Optimization knobs per target

**For CPU (agent-heavy)**

* Quantize to INT8 or 4-bit; enable AMX/VNNI/SME fast paths
* Use speculative decoding (draft-model 1–3B on CPU) then **verify on CPU/GPU** if needed
* KV-cache paging to system RAM; smaller heads / grouped-QK attention if available
* Fuse pre/post steps (tokenization, retrieval, tool adapters) on the same CPU host to avoid PCIe hops

**For GPU (chat/research)**

* Enable continuous batching & paged attention
* Use FP8/BF16 where quality allows; enable tensor-parallel for ≥70B
* Pin RAG pipelines close to GPU (GPU-resident embeddings, vector search cache, or at least NVMe cache)
* Warm pools to hit <50 ms first-token

### Suggested Cortensor policies (ready-to-implement)

1. **Policy: Workload-aware placement**

```
if job.type == "agent":
  if model.params <= 13B and quantized: target = CPU(AMX|AVX512|SME2)
  else: target = GPU
else if job.type in {"chat","deep_research"}:
  if model.params <= 7B and quantized and ctx_len <= 16k and qps < 2: target = CPU
  else: target = GPU
```

2. **Policy: ISA-aware CPU dispatch**

```
CPU_AMX  -> prefer INT8/BF16 kernels (oneDNN/OpenVINO)
CPU_AVX512-> prefer 4-bit/INT8 ggml/llama.cpp fast paths
CPU_SME2 -> ONNX Runtime + KleidiAI when available (Android/Arm nodes)
```

3. **Policy: Dynamic failover**

* If GPU queue depth > threshold or batcher starved, **reassign small agent steps to CPU** to preserve end-to-end task time.
* If CPU p95 > SLA for two windows, **promote** job class to GPU until backlog clears.

4. **Policy: Quantization tiers**

* **Agent tier**: 4-bit for tool calls & planners; 8-bit verifier/critic
* **Chat tier**: 8-bit or BF16 for final generation on GPU; keep reranker/embedding on CPU if helpful

### Example mappings

* **Agentic web-tool bot (7B, 8k ctx, bursts)** → CPU-AMX/AVX512, INT8, speculative decoding on 2–4 threads; promote rare long answers to GPU.
* **Analyst chat (13B, 32k ctx, steady traffic)** → GPU BF16/FP8 with continuous batching; CPU handles retrieval & post-proc.
* **On-device assistant (3–7B, mobile/edge)** → Arm-SME2 (as available) with 4-bit; offload long tasks to cloud GPU.

### How to measure (routing signals)

* **t/s**, **first-token ms**, **KV-cache MB/token**, **context len**, **QPS**, **burstiness factor** (p50 interarrival vs p95)
* Promote/demote rules on rolling windows (e.g., 60–120s) with hysteresis to avoid thrash

***

#### Bottom line

* **Agent workloads**: favor **CPUs** (AMX/AVX-512/SME2) for small/quantized models and spiky demand; they minimize idle cost and keep “time-to-task” low.
* **Chat/deep-research**: favor **GPUs** for long contexts and steady decoding; batching + HBM dominate cost/perf.
* A **heterogeneous policy** that auto-routes by **model size, precision, context, QPS, and SLA** will beat either CPU-only or GPU-only strategies on both cost and user experience.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.cortensor.network/technical-architecture/ai-inference/cpu-instruction-sets-for-llm-inference-avx-amx-sme-vs-gpus/agent-vs-chat-deep-research-inference.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
