Learning Objectives
- Explain when agentic workflows add value in finance and when conventional statistical or rules-based pipelines remain the better choice
- Distinguish the roles of ReAct, Tree of Thoughts, and Reflexion, and choose appropriate reasoning budgets and compositions for evidence-driven financial tasks
- Design explicit agent state and memory schemas that support provenance, checkpointing, replay, schema evolution, and post-outcome evaluation
- Specify robust tool contracts, structured outputs, source policies, and context-engineering rules for read-only research and forecasting agents
- Compare framework styles and define a migration path from notebook prototypes to operational forecasting services without sacrificing visibility and control
- Build a single-agent evidence-first research workflow with quality gates, abstention behavior, and replayable artifacts
- Design and evaluate multi-agent forecasting pipelines using specialist diversity, aggregation, calibration, baselines, and ablation analysis
- Define the operational, statistical, and security controls required to make financial-agent outputs decision-grade, including point-in-time integrity, contamination-aware testing, observability, policy gates, and human approval boundaries
From Prediction Functions to Agentic Workflows
This section establishes that agentic systems extend traditional ML pipelines upstream into messier territory where evidence is noisy, heterogeneous, and incomplete, rather than replacing statistical modeling. The chapter's scope is deliberately constrained to read-only information actions (retrieving data, querying filings, requesting calculations) rather than order execution, with outputs at the L1 decision-support level producing structured forecasts for human review. A running example of estimating the probability that a specific earnings event resolves positively ties together the entire chapter, with each section mapping to one stage of a six-phase pipeline from evidence gathering through calibration and scoring.
1 notebook
Cognitive Architectures: How Agents Reason
The section introduces three reasoning frameworks as building blocks for agent design: ReAct for auditable evidence-grounded loops, Tree of Thoughts for parallel hypothesis exploration at decision branch points, and Reflexion for post-run critique that conditions future behavior through stored lessons. It argues for a layered adoption sequence starting with ReAct as the baseline, adding ToT only for branch-heavy decisions, and layering Reflexion only after the evaluation pipeline is stable enough to identify which lessons are worth persisting. Practical failure modes of each pattern and guidance on maximum reasoning budgets per run prevent unbounded inference costs.
Agent Memory: State, Persistence, and Replay
This section argues that explicit memory design is part of model risk control, not just software architecture, establishing a three-tier hierarchy of working memory (current reasoning step), short-term session memory (attempted actions and partial results), and long-term persistent memory (run artifacts, scored outcomes, calibration history). Typed state objects with fields like cutoff_date, evidence records, and tool-call traces replace implicit chat history to enable deterministic validation, checkpoint-based replay, and systematic debugging of non-deterministic systems. Memory lifecycle governance including retention windows, eviction rules, conflict-handling policies, and schema versioning prevents stale assumptions from accumulating as persistent analytical bias.
1 notebook
Tool Integration: Contracts, Controls, and Context Engineering
The section presents tool design as often the dominant determinant of agent quality in finance, outweighing prompt engineering and model selection. Strong tool contracts specify purpose, typed arguments, error semantics, and provenance fields, while context engineering controls what each reasoning step can see by exposing only task-relevant state fields, phase-appropriate tools, and point-in-time-consistent evidence sources. Source policies with domain allowlists, date constraints tied to cutoff dates, and mandatory provenance metadata on every returned item enforce the discipline that makes outputs auditable and reproducible.
1 notebook
The Engineering Stack: Frameworks and Migration
This section argues that framework selection should follow from explicit constraints like state visibility, replay capability, and policy enforcement rather than benchmark accuracy claims, since the binding constraints for financial agent workflows are governance properties not task scores. It proposes a migration sequence that starts with native SDK calls and typed schemas, adds explicit state objects and trace capture, then checkpoints and replay hooks, and finally multi-agent orchestration only when evaluation demonstrates measurable improvement. The comparison between native SDK, CrewAI-style roles, and LangGraph-style state graphs clarifies migration trade-offs while pointing toward the production forecaster as the culmination of notebook patterns with stronger operational constraints.
1 notebook
Core Project: Stateful Equity Research Agent
This section operationalizes the preceding design patterns in a single-agent equity research workflow for NVIDIA, where every claim must trace to a specific tool call and the system must persist enough artifacts for replay. Three quality gates (coverage, freshness, consistency) must pass before synthesis, and if any gate fails the agent abstains with a bounded output explaining which conditions were not met. The section establishes acceptance criteria for promoting to multi-agent workflows: stable task success across repeated runs, citation faithfulness above threshold, low policy-violation rates, reproducible replay, and explicit abstention when evidence is insufficient.
1 notebook
Multi-Agent Forecasting Systems
The section extends the single-agent baseline to a multi-agent forecasting architecture with six layers: intake, parallel research agents, aggregation with optional extremization, adversarial debate, policy-bound supervisor reconciliation, and probability calibration. Neyman extremization amplifies deviations from base rates proportional to genuine forecaster diversity, while Platt scaling addresses the documented miscalibration of LLM probability estimates. The evaluation methodology requires proper scoring rules (Brier score, log score, ECE), ablation analysis testing whether each component adds measurable value, and baseline comparison against market consensus, with components that fail the retention rubric demoted or removed.
5 notebooks
Production: Reliability, Replay, and Contamination Control
This section addresses the gap between notebook prototypes and decision-grade production systems, covering non-determinism management through bounded retries and checkpoint recovery, observability through complete operational traces separating engineering and research views, and contamination-resistant evaluation using strict temporal splits, time-shift tests, and event windows. It presents a three-dimensional evaluation stack spanning research agent quality (task success, citation faithfulness), forecasting quality (Brier score, calibration, sharpness), and operational health (incident rate, policy violations, drift alerts). Cost optimization through model cascading, caching, and per-forecast budgets is treated as valid only when it does not degrade calibration or increase policy violations.
Security and Governance
The section formalizes a threat model spanning input, retrieval, tool, state, and output layers, with prompt injection and retrieval poisoning identified as primary risks in tool-connected systems. The Warden pattern interposes a policy proxy between agent output and runtime execution, validating operations against allowlists, enforcing argument constraints, and logging every decision to immutable storage. Security testing with adversarial scenarios (injection payloads, malformed arguments, tool shadowing, stale-source retrieval) converts security posture from narrative assurance to measurable system behavior, with metrics like prompt-injection success rate, unsupported-claim rate, and mean time to incident diagnosis integrated into the same trace infrastructure used for forecasting evaluation.