Learning Objectives
- Distinguish financial questions that genuinely require graph structure from those better served by tabular databases
- Design a compact, typed, and auditable financial knowledge graph with stable entity identity, finite relationship
- Build and validate LLM-assisted extraction pipelines that convert disclosures into replayable graph objects while
- Explain how Graph RAG differs from vector retrieval and implement safe relational query workflows using constrained
- Transform graph structure into leakage-aware machine learning features, including topology, crowding, concentration,
- Evaluate explicit knowledge graphs, statistical financial networks, and learned graph representations pragmatically
- Apply a three-timestamp framework and disclosure-time cutoff rules to prevent temporal leakage in graph queries,
- Make sound engineering choices about graph databases, ontology scope, query safety, and schema evolution for
When Relational Structure Unlocks Financial Insight
This section establishes a decision framework for when graph infrastructure is justified, identifying three conditions that reliably add value: multi-hop dependency queries (supply chain contagion analysis), structural crowding and co-ownership patterns (institutional holdings that predict excess comovement), and temporal relationship evolution (edges that change before price data reflects the shift). Equally important, it identifies three conditions where graphs do not help: single-entity attribute lookups, narrative synthesis over broad corpora, and sparse graphs with few relationships where topology metrics become unreliable. The practical test is whether the question naturally decomposes into path patterns across entities.
2 notebooks
Constructing Financial Knowledge Graphs
The section presents a governance-first approach to KG construction where identity contracts, schema contracts, and provenance contracts are defined before any extraction begins, then paired with LLM-based extraction for throughput. A five-stage workflow covers targeted document slicing, schema-constrained generation, canonicalization, rule validation, and human review queuing, with emphasis on idempotent writes and quality metrics beyond precision (schema-valid rate, provenance coverage, duplicate-node rate, temporal consistency). The supply chain schema example demonstrates extracting supplier, customer, and competitive relationships from S&P 100 10-K filings, revealing shared suppliers like TSMC and Foxconn as concentration nodes.
2 notebooks
Graph RAG: Deterministic Relational Reasoning
Graph RAG delegates relational joins to the database engine and language generation to the LLM, providing deterministic multi-hop retrieval that vector search cannot match for path-dependent questions. The five-stage architecture covers query routing, text-to-Cypher generation with safety validation, deterministic execution, and grounded synthesis with two-layer citations linking graph rows to underlying disclosure text. FinReflectKG-MultiHop benchmark evidence shows KG-guided retrieval improving correctness by approximately 24% while reducing token consumption by roughly 85% compared to page-window retrieval, demonstrating that structured retrieval is both more accurate and more economical for relational questions.
2 notebooks
From Graphs to Machine Learning Features
This section transforms graph structure into tabular features for gradient boosting and factor models, covering network topology features (PageRank, betweenness, clustering coefficient), supply chain risk indicators (supplier count, single-source dependency, overlap ratios), and institutional crowding features (crowding score, smart money concentration, ownership HHI, co-ownership Jaccard). The distinctive value emerges from cross-graph integration: combining supply chain concentration with institutional crowding creates compound risk features that encode structural dependencies traditional factor models treat as independent. Temporal dynamics such as relationship churn, centrality momentum, and event propagation lags add lead-lag features that capture evolving structure invisible to static snapshots.
1 notebook
Financial Networks: From Correlations to Portfolios
The section connects the established financial networks literature (correlation-based MSTs, hierarchical clustering) to knowledge graph features, showing how network-aware portfolio allocation weights peripheral stocks more heavily for diversification benefits. It provides a maturity assessment of graph neural networks across financial domains: fraud detection is production-ready with clear labels and high signal-to-noise ratios, systemic risk monitoring is emerging in regulatory pilots, and alpha generation remains experimental with limited reproducible evidence after costs. The pragmatic recommendation is to start with hand-crafted graph features as the auditable, stable baseline and add GNNs only if they improve out-of-sample metrics after transaction costs and temporal leakage controls.
1 notebook
Temporal Integrity and Leakage-Safe Evaluation
This section introduces the three-timestamp model (event time, disclosure time, extraction time) as central to financial knowledge graphs, arguing that feature generation and historical queries must use disclosure time as the visibility gate since economically true but undisclosed information cannot enter models. Using 8-K filings and 13F holdings as case studies, it demonstrates how disclosure lags, position masking, and confidential treatment gaps create temporal inconsistencies that must be handled explicitly. The leakage-safe evaluation protocol requires splitting by disclosure time, embargoing observations near split boundaries, enforcing cutoff filtering in every query path, and logging snapshot hashes for replay and audit.
2 notebooks
Building a KG-Ready Pipeline: Engineering Decisions
The section addresses the infrastructure choices that make knowledge graph workflows reliable: database engine selection (Neo4j as the teaching default, with PostgreSQL recursive CTEs as a simpler alternative for narrow graphs), ontology strategy (starting compact with 8-15 relationship types rather than attempting full FIBO adoption), and query safety controls including read-only credentials, allowlisted labels, parameterized queries, and execution limits. Schema versioning discipline is emphasized, with additive changes as the safe default and version metadata on edges enabling coexistence of old and new extraction formats during transitions.