Agent vs Chat/Deep-Research Inference

Workload shapes (why they differ)

  • Agent inference (tool-use, function calls, short bursts)

    • Spiky, I/O-bound, lots of small decode segments

    • Frequent context updates (tool outputs), smaller active batch

    • Latency tolerance varies by step; overall “time-to-task” matters more than raw tokens/sec

  • Chat / deep-research inference (long context, long answers)

    • Sustained decoding, large prompts, bigger KV caches

    • Amenable to batching/throughput optimization

    • Sensitive to first-token latency and steady tokens/sec

CPU vs GPU fit by workload

Dimension
Agent (tools, routing, short bursts)
Chat / Deep-research (long ctx, long outputs)

Best silicon

CPU-first (AMX/AVX-512/SME) when models ≤13B and quantized (INT8/4-bit); GPUs for larger tools or vision/multimodal steps

GPU-first, esp. ≥13B, long contexts, or high concurrency; CPU viable for ≤7–13B quantized, low concurrency

Batching gains

Low; steps are irregular → GPUs underutilize unless you coalesce across users

High; steady streams boost GPU utilization massively

Perf/Watt

CPUs competitive for intermittent bursts (lower idle cost); SME2 compelling on-device

GPUs win for sustained decoding and high batch

Latency to first token

CPU can be good enough if model small + warm; otherwise GPU leads

GPU typically leads (kernel fusion, HBM)

Memory pressure

Lower (short prompts, short outputs per step)

High (long prompts, KV cache growth); favors GPUs with HBM or CPU with huge RAM but lower BW

Ops complexity

Simple to scale horizontally with commodity CPUs

GPU scheduling, batching, paged attention more involved but pays off at scale

Practical routing rules (drop into Router/NodePool)

By model size & precision

  • ≤7B, INT8/4-bit → CPU preferred (Xeon-AMX / EPYC-AVX512 / Arm-SME2 when available).

  • 13B, INT8/4-bit → CPU for agent; GPU for chat (switch if prompt >16–32k or strict latency SLA).

  • ≥33B or FP16/BF16 → GPU (both agent and chat), unless agent steps are rare and latency budget is loose.

By prompt/context

  • Context ≤16k tokens → CPUs remain viable for agent; GPUs for long replies.

  • Context >32k tokens or heavy RAG stitching → GPU (paged attention, KV offload efficiency).

By concurrency

  • QPS < 2 per model instance (bursty agents) → CPU wins on TCO.

  • QPS ≥ 5 with steady streams (chat, research) → GPU for utilization and joules/token.

By SLA

  • p95 step latency ≤150 ms (agent tool loop) → small CPU models or GPU if model >13B / multimodal.

  • First-token ≤75 ms and sustained ≥150 t/s/thread → GPU.

Optimization knobs per target

For CPU (agent-heavy)

  • Quantize to INT8 or 4-bit; enable AMX/VNNI/SME fast paths

  • Use speculative decoding (draft-model 1–3B on CPU) then verify on CPU/GPU if needed

  • KV-cache paging to system RAM; smaller heads / grouped-QK attention if available

  • Fuse pre/post steps (tokenization, retrieval, tool adapters) on the same CPU host to avoid PCIe hops

For GPU (chat/research)

  • Enable continuous batching & paged attention

  • Use FP8/BF16 where quality allows; enable tensor-parallel for ≥70B

  • Pin RAG pipelines close to GPU (GPU-resident embeddings, vector search cache, or at least NVMe cache)

  • Warm pools to hit <50 ms first-token

Suggested Cortensor policies (ready-to-implement)

  1. Policy: Workload-aware placement

if job.type == "agent":
  if model.params <= 13B and quantized: target = CPU(AMX|AVX512|SME2)
  else: target = GPU
else if job.type in {"chat","deep_research"}:
  if model.params <= 7B and quantized and ctx_len <= 16k and qps < 2: target = CPU
  else: target = GPU
  1. Policy: ISA-aware CPU dispatch

CPU_AMX  -> prefer INT8/BF16 kernels (oneDNN/OpenVINO)
CPU_AVX512-> prefer 4-bit/INT8 ggml/llama.cpp fast paths
CPU_SME2 -> ONNX Runtime + KleidiAI when available (Android/Arm nodes)
  1. Policy: Dynamic failover

  • If GPU queue depth > threshold or batcher starved, reassign small agent steps to CPU to preserve end-to-end task time.

  • If CPU p95 > SLA for two windows, promote job class to GPU until backlog clears.

  1. Policy: Quantization tiers

  • Agent tier: 4-bit for tool calls & planners; 8-bit verifier/critic

  • Chat tier: 8-bit or BF16 for final generation on GPU; keep reranker/embedding on CPU if helpful

Example mappings

  • Agentic web-tool bot (7B, 8k ctx, bursts) → CPU-AMX/AVX512, INT8, speculative decoding on 2–4 threads; promote rare long answers to GPU.

  • Analyst chat (13B, 32k ctx, steady traffic) → GPU BF16/FP8 with continuous batching; CPU handles retrieval & post-proc.

  • On-device assistant (3–7B, mobile/edge) → Arm-SME2 (as available) with 4-bit; offload long tasks to cloud GPU.

How to measure (routing signals)

  • t/s, first-token ms, KV-cache MB/token, context len, QPS, burstiness factor (p50 interarrival vs p95)

  • Promote/demote rules on rolling windows (e.g., 60–120s) with hysteresis to avoid thrash


Bottom line

  • Agent workloads: favor CPUs (AMX/AVX-512/SME2) for small/quantized models and spiky demand; they minimize idle cost and keep “time-to-task” low.

  • Chat/deep-research: favor GPUs for long contexts and steady decoding; batching + HBM dominate cost/perf.

  • A heterogeneous policy that auto-routes by model size, precision, context, QPS, and SLA will beat either CPU-only or GPU-only strategies on both cost and user experience.

Last updated