Agent vs Chat/Deep-Research Inference
Workload shapes (why they differ)
Agent inference (tool-use, function calls, short bursts)
Spiky, I/O-bound, lots of small decode segments
Frequent context updates (tool outputs), smaller active batch
Latency tolerance varies by step; overall “time-to-task” matters more than raw tokens/sec
Chat / deep-research inference (long context, long answers)
Sustained decoding, large prompts, bigger KV caches
Amenable to batching/throughput optimization
Sensitive to first-token latency and steady tokens/sec
CPU vs GPU fit by workload
Best silicon
CPU-first (AMX/AVX-512/SME) when models ≤13B and quantized (INT8/4-bit); GPUs for larger tools or vision/multimodal steps
GPU-first, esp. ≥13B, long contexts, or high concurrency; CPU viable for ≤7–13B quantized, low concurrency
Batching gains
Low; steps are irregular → GPUs underutilize unless you coalesce across users
High; steady streams boost GPU utilization massively
Perf/Watt
CPUs competitive for intermittent bursts (lower idle cost); SME2 compelling on-device
GPUs win for sustained decoding and high batch
Latency to first token
CPU can be good enough if model small + warm; otherwise GPU leads
GPU typically leads (kernel fusion, HBM)
Memory pressure
Lower (short prompts, short outputs per step)
High (long prompts, KV cache growth); favors GPUs with HBM or CPU with huge RAM but lower BW
Ops complexity
Simple to scale horizontally with commodity CPUs
GPU scheduling, batching, paged attention more involved but pays off at scale
Practical routing rules (drop into Router/NodePool)
By model size & precision
≤7B, INT8/4-bit → CPU preferred (Xeon-AMX / EPYC-AVX512 / Arm-SME2 when available).
13B, INT8/4-bit → CPU for agent; GPU for chat (switch if prompt >16–32k or strict latency SLA).
≥33B or FP16/BF16 → GPU (both agent and chat), unless agent steps are rare and latency budget is loose.
By prompt/context
Context ≤16k tokens → CPUs remain viable for agent; GPUs for long replies.
Context >32k tokens or heavy RAG stitching → GPU (paged attention, KV offload efficiency).
By concurrency
QPS < 2 per model instance (bursty agents) → CPU wins on TCO.
QPS ≥ 5 with steady streams (chat, research) → GPU for utilization and joules/token.
By SLA
p95 step latency ≤150 ms (agent tool loop) → small CPU models or GPU if model >13B / multimodal.
First-token ≤75 ms and sustained ≥150 t/s/thread → GPU.
Optimization knobs per target
For CPU (agent-heavy)
Quantize to INT8 or 4-bit; enable AMX/VNNI/SME fast paths
Use speculative decoding (draft-model 1–3B on CPU) then verify on CPU/GPU if needed
KV-cache paging to system RAM; smaller heads / grouped-QK attention if available
Fuse pre/post steps (tokenization, retrieval, tool adapters) on the same CPU host to avoid PCIe hops
For GPU (chat/research)
Enable continuous batching & paged attention
Use FP8/BF16 where quality allows; enable tensor-parallel for ≥70B
Pin RAG pipelines close to GPU (GPU-resident embeddings, vector search cache, or at least NVMe cache)
Warm pools to hit <50 ms first-token
Suggested Cortensor policies (ready-to-implement)
Policy: Workload-aware placement
if job.type == "agent":
if model.params <= 13B and quantized: target = CPU(AMX|AVX512|SME2)
else: target = GPU
else if job.type in {"chat","deep_research"}:
if model.params <= 7B and quantized and ctx_len <= 16k and qps < 2: target = CPU
else: target = GPU
Policy: ISA-aware CPU dispatch
CPU_AMX -> prefer INT8/BF16 kernels (oneDNN/OpenVINO)
CPU_AVX512-> prefer 4-bit/INT8 ggml/llama.cpp fast paths
CPU_SME2 -> ONNX Runtime + KleidiAI when available (Android/Arm nodes)
Policy: Dynamic failover
If GPU queue depth > threshold or batcher starved, reassign small agent steps to CPU to preserve end-to-end task time.
If CPU p95 > SLA for two windows, promote job class to GPU until backlog clears.
Policy: Quantization tiers
Agent tier: 4-bit for tool calls & planners; 8-bit verifier/critic
Chat tier: 8-bit or BF16 for final generation on GPU; keep reranker/embedding on CPU if helpful
Example mappings
Agentic web-tool bot (7B, 8k ctx, bursts) → CPU-AMX/AVX512, INT8, speculative decoding on 2–4 threads; promote rare long answers to GPU.
Analyst chat (13B, 32k ctx, steady traffic) → GPU BF16/FP8 with continuous batching; CPU handles retrieval & post-proc.
On-device assistant (3–7B, mobile/edge) → Arm-SME2 (as available) with 4-bit; offload long tasks to cloud GPU.
How to measure (routing signals)
t/s, first-token ms, KV-cache MB/token, context len, QPS, burstiness factor (p50 interarrival vs p95)
Promote/demote rules on rolling windows (e.g., 60–120s) with hysteresis to avoid thrash
Bottom line
Agent workloads: favor CPUs (AMX/AVX-512/SME2) for small/quantized models and spiky demand; they minimize idle cost and keep “time-to-task” low.
Chat/deep-research: favor GPUs for long contexts and steady decoding; batching + HBM dominate cost/perf.
A heterogeneous policy that auto-routes by model size, precision, context, QPS, and SLA will beat either CPU-only or GPU-only strategies on both cost and user experience.
Last updated