In 2024 Jeff McMillan returned to Invisible Machines alongside David Wu, Head of Knowledge Management & Generative AI at Morgan Stanley, for a deep dive into what it actually takes to curate institutional knowledge before retrieval—not after a demo ships.
This article summarizes that conversation. It is not a transcript. For Jeff’s third visit on agent-scale foundations, see Knowledge Before Agents and the arc essay connecting all three appearances.
From Dump to Domain Owners
The starting mistake is familiar: treat RAG as a summarizer over everything you ever filed. McMillan and Wu described shrinking roughly sixty thousand documents toward twenty thousand—not by deleting history arbitrarily, but by assigning domain owners who could defend why a piece of knowledge still belonged in the corpus. Unstructured data had to be accounted for, but so did accountability: every document needed a human attached who could provide oversight when answers drifted.
The internal model they pursued looked less like a chatbot over a file share and more like an internal Wikipedia—canonical entries, explicit ownership, time-to-live on content so stale guidance does not become immortal because nobody scheduled a review.
Vector Plus Metadata Hybrid
Pure vector search was not enough. The team combined embeddings with rich metadata—product lines, jurisdictions, audience, freshness—so retrieval could be governed, not merely similarity-matched. Hybrid search let advisors get answers that respected business rules, not just nearest neighbors in embedding space.
Regression Testing and Human Comparison
When models upgrade, answers change. McMillan emphasized regression testing against golden question sets—and comparing AI output to what experienced humans would say. Not as a one-time bake-off, but as ongoing discipline. Edge cases, not happy paths, are where agents get you.
They also discussed structured feedback loops: thumbs-down from real users, methodical prompt coaching for financial advisors, and the humility to admit when the machine is confidently wrong. Experts would rather serve clients than sit in evaluation sessions—but without that work, you are monitoring chaos you cannot see.
Prompt Coaching as Literacy
Retrieval quality depends on how people ask. Wu and McMillan treated prompt coaching as operational training—not a hackathon gimmick—so advisors learned to interrogate the system the way they would a junior analyst. Garbage in, garbage out survives every interface upgrade.
Why It Still Matters
Every theme in that hour—curation, ownership, hybrid retrieval, regression, human comparison—shows up again in McMillan’s 2026 return as agent-scale checklist items. The vocabulary shifted from digital assistant to fifteen thousand bots; the obligation to fund the bottom of the stack did not.