Agent vs Chat/Deep-Research Inference

Workload shapes (why they differ)

Agent inference (tool-use, function calls, short bursts)
- Spiky, I/O-bound, lots of small decode segments
- Frequent context updates (tool outputs), smaller active batch
- Latency tolerance varies by step; overall “time-to-task” matters more than raw tokens/sec
Chat / deep-research inference (long context, long answers)
- Sustained decoding, large prompts, bigger KV caches
- Amenable to batching/throughput optimization
- Sensitive to first-token latency and steady tokens/sec

CPU vs GPU fit by workload

Dimension

Agent (tools, routing, short bursts)

Chat / Deep-research (long ctx, long outputs)

Best silicon

CPU-first (AMX/AVX-512/SME) when models ≤13B and quantized (INT8/4-bit); GPUs for larger tools or vision/multimodal steps

GPU-first, esp. ≥13B, long contexts, or high concurrency; CPU viable for ≤7–13B quantized, low concurrency

Batching gains

Low; steps are irregular → GPUs underutilize unless you coalesce across users

High; steady streams boost GPU utilization massively

Perf/Watt

CPUs competitive for intermittent bursts (lower idle cost); SME2 compelling on-device

GPUs win for sustained decoding and high batch

Latency to first token

CPU can be good enough if model small + warm; otherwise GPU leads

GPU typically leads (kernel fusion, HBM)

Memory pressure

Lower (short prompts, short outputs per step)

High (long prompts, KV cache growth); favors GPUs with HBM or CPU with huge RAM but lower BW

Ops complexity

Simple to scale horizontally with commodity CPUs

GPU scheduling, batching, paged attention more involved but pays off at scale

Practical routing rules (drop into Router/NodePool)

By model size & precision

≤7B, INT8/4-bit → CPU preferred (Xeon-AMX / EPYC-AVX512 / Arm-SME2 when available).
13B, INT8/4-bit → CPU for agent; GPU for chat (switch if prompt >16–32k or strict latency SLA).
≥33B or FP16/BF16 → GPU (both agent and chat), unless agent steps are rare and latency budget is loose.

By prompt/context

Context ≤16k tokens → CPUs remain viable for agent; GPUs for long replies.
Context >32k tokens or heavy RAG stitching → GPU (paged attention, KV offload efficiency).

By concurrency

QPS < 2 per model instance (bursty agents) → CPU wins on TCO.
QPS ≥ 5 with steady streams (chat, research) → GPU for utilization and joules/token.

By SLA

p95 step latency ≤150 ms (agent tool loop) → small CPU models or GPU if model >13B / multimodal.
First-token ≤75 ms and sustained ≥150 t/s/thread → GPU.

Optimization knobs per target

For CPU (agent-heavy)

Quantize to INT8 or 4-bit; enable AMX/VNNI/SME fast paths
Use speculative decoding (draft-model 1–3B on CPU) then verify on CPU/GPU if needed
KV-cache paging to system RAM; smaller heads / grouped-QK attention if available
Fuse pre/post steps (tokenization, retrieval, tool adapters) on the same CPU host to avoid PCIe hops

For GPU (chat/research)

Enable continuous batching & paged attention
Use FP8/BF16 where quality allows; enable tensor-parallel for ≥70B
Pin RAG pipelines close to GPU (GPU-resident embeddings, vector search cache, or at least NVMe cache)
Warm pools to hit <50 ms first-token

Suggested Cortensor policies (ready-to-implement)

Policy: Workload-aware placement

if job.type == "agent":
  if model.params <= 13B and quantized: target = CPU(AMX|AVX512|SME2)
  else: target = GPU
else if job.type in {"chat","deep_research"}:
  if model.params <= 7B and quantized and ctx_len <= 16k and qps < 2: target = CPU
  else: target = GPU

Policy: ISA-aware CPU dispatch

CPU_AMX  -> prefer INT8/BF16 kernels (oneDNN/OpenVINO)
CPU_AVX512-> prefer 4-bit/INT8 ggml/llama.cpp fast paths
CPU_SME2 -> ONNX Runtime + KleidiAI when available (Android/Arm nodes)

Policy: Dynamic failover

If GPU queue depth > threshold or batcher starved, reassign small agent steps to CPU to preserve end-to-end task time.
If CPU p95 > SLA for two windows, promote job class to GPU until backlog clears.

Policy: Quantization tiers

Agent tier: 4-bit for tool calls & planners; 8-bit verifier/critic
Chat tier: 8-bit or BF16 for final generation on GPU; keep reranker/embedding on CPU if helpful

Example mappings

Agentic web-tool bot (7B, 8k ctx, bursts) → CPU-AMX/AVX512, INT8, speculative decoding on 2–4 threads; promote rare long answers to GPU.
Analyst chat (13B, 32k ctx, steady traffic) → GPU BF16/FP8 with continuous batching; CPU handles retrieval & post-proc.
On-device assistant (3–7B, mobile/edge) → Arm-SME2 (as available) with 4-bit; offload long tasks to cloud GPU.

How to measure (routing signals)

t/s, first-token ms, KV-cache MB/token, context len, QPS, burstiness factor (p50 interarrival vs p95)
Promote/demote rules on rolling windows (e.g., 60–120s) with hysteresis to avoid thrash

Bottom line

Agent workloads: favor CPUs (AMX/AVX-512/SME2) for small/quantized models and spiky demand; they minimize idle cost and keep “time-to-task” low.
Chat/deep-research: favor GPUs for long contexts and steady decoding; batching + HBM dominate cost/perf.
A heterogeneous policy that auto-routes by model size, precision, context, QPS, and SLA will beat either CPU-only or GPU-only strategies on both cost and user experience.

PreviousCost Analysis NextTCO Impact

Last updated 2 months ago