The visible layer is not the hard part.

In a recent return conversation on Invisible Machines, Jeff McMillan—founder of McMillanAI and former Head of Firmwide AI—describes what executives and technologists actually want to discuss: applications, models, orchestration, the demo that ships before the quarter ends. That is where the hype lives. It is also, in his framing, the top of a stack that collapses without everything underneath it.

Layer one is information—accessible, high quality, structured so AI can use it. Layer two is semantic: knowledge graphs, RAG, the connective tissue that turns documents into reasoning surfaces. Layer three is control: business rules, ethics, governance, monitoring. Only then come models, orchestration, and applications. McMillan has been saying variants of this for years. What changed is the stakes. A handful of agents can be brute-forced. Fifteen hundred cannot. At scale, a dataset that is ninety percent accessible is not nearly enough—you need something closer to full accessibility and quality in the high nineties, plus the instrumentation to know when agents drift.

Josh Tyson opens the episode with the parallel that makes the thesis land outside the data center: executives are doing the same thing with tools. They want to buy and bolt on before they have a source of truth. Agents, like leaders, need to be taught. Without canonical knowledge, both guess confidently.

The Tipping Point

Robb Wilson, CEO of OneReach.ai and co-host of the show, asks the question most knowledge-management programs fear: is there a way around this? McMillan’s answer is a tipping point, not a loophole. Five or ten or fifteen agents, you can fake the foundation. At fifteen thousand, you cannot. And the org-chart version of the problem is familiar: ten thousand employees carry tribal knowledge senior leaders never had to formalize because training the next person was good enough—until a system executes whatever you ask without the twenty years of judgment that tells a human “that looks dumb.”

That asymmetry drives the episode’s best anthropomorphic warning. Humans stop. They conserve effort. They have other things to do. Large language models do not. Robb’s book-collaboration metaphor lands cleanly: you will not return to a colleague who has written two thousand pages unchecked; an LLM might produce a hundred thousand if nobody intervenes. Helpful for understanding agents—dangerous if mistaken for human-like restraint. McMillan doubles down: if you want high-quality output, you must be explicit, iterate prompts dozens or hundreds of times, and test against golden sources because non-deterministic systems do weird things in production.

Custom Evals

Evaluation, he argues, is the missing organizational muscle. People like to build. They do not like to test. Experts have clients to serve, not four-hour evaluation blocks when McMillan comes calling. Wilson names what practitioners need: custom evals—organizational AGI, not generic benchmarks—kept stable so you know whether a model upgrade, prompt change, or new agent actually improved anything or silently regressed from eighty-nine percent to seventy-four.

The conversation turns to a failure mode Wilson sees in the field: CTOs “catching up on backlogs” with AI—shipping code for features nobody will use, burning tokens while revenue flatlines. McMillan is slightly gentler: controlled experimentation with training has learning value. But after six or nine months, the strategic questions should dominate. What are the five to ten things that matter at the business level, and how does AI enable them? Who destroys you in ten years, and what moats do you need?

He illustrates the dopamine trap personally: managing a farm with AI, four hours in, proud of the progress—then asking the system for more ideas and receiving thirty-seven of them. “Yeah—go ahead and implement them all.” Then fourteen hours debugging because scope exploded. Token consumption is easy to measure. Impact is not. Even perfect instrumentation on efficiency gains misses the second question: was freed capacity applied to something value-added, or did everyone go play golf?

Process Mapping Before Automation

Josh Tyson connects this back to knowledge management done seriously—not summarizing PDFs into a vector store and calling it done, but mapping processes end to end, surfacing implicit knowledge that might require leaving your desk to talk to someone. McMillan agrees in the broad sense and sharpens it: high-end knowledge businesses hire brilliant people and let tribal apprenticeship carry process understanding. Ask a senior leader to describe a core workflow with the specificity a consulting firm would demand for a process map, and most cannot. Without that clarity—and without standard KPIs—you cannot know whether AI output is better or worse than what you had.

Technologists and business leaders are at war over who owns the application layer. Robb Wilson’s gardening metaphor lands because it is precise—you can argue about who pushes the lawn mower while ignoring botany entirely. True transformation in a world of fifteen thousand bots is not an app-owner problem. It is a foundation problem.

Agent in the Loop

Governance is not a slide. McMillan walks through embedded controls, organizational ethics expressed in prompts (constitutions are not just for Anthropic), and monitoring layers—including independent models asking whether something smells wrong, the way one colleague reviews another’s work. Wilson flips “human in the loop” to agent in the loop: the agent surfaces missing information, seeks a human source, and should not institutionalize an answer until someone with accountability validates it. All information ultimately came from a person; the question is which person and whether anyone has something to lose if it is wrong.

The medical-chain thought experiment makes the stakes visceral: doctor, diagnostic agent, prescribing agent, agentic pharmacy, drone delivery—then the wrong pills and a death. Whose fault? MCP does not solve accountability. Agents do not have paychecks to lose. McMillan predicts a near future where professionals own and are responsible for agents the way they manage people today—because harm at scale exceeds what three executives could do in a day when one agent emails four million clients or fills the wrong prescription batch.

Use Case Zero

Wilson and McMillan converge on a design implication easy to miss in vendor decks: knowledge infrastructure is not only for agents. The humans approving agent output need the same accessible, high-quality context—especially when consequences are large and decisions must be fast. Use case zero in Wilson’s framing is not a Q&A box that says “ask me anything” after you train on a book; it is a system that learns, maintains knowledge, and teaches back—adapting to who you are, what you know, and how you learn, with feedback loops when people spot errors.

McMillan closes the philosophical thread where the episode earns its optimism. AI can make you dumb if you want to be dumb—lazy, deferring judgment, letting the model pick the donut. It can also make you smarter than any calendar of human advisors allows. He built a “Board of Advisors” on MacmillanAI.com—twenty perspectives from Roosevelt to Feynman to Mandela—because mixture-of-experts argument surfaces tradeoffs the way a good room does. Smart people with better information make better decisions. Generative AI amplifies that—if you apply critical thinking. If you do not, it amplifies the opposite.

The lawn mower fight is the meme. The thesis is structural: you are not ready for fifteen thousand bots because you have not funded the bottom layers, have not mapped how work actually moves, have not built evals that survive model upgrades, and have not decided who loses their job when an agent goes wrong. Knowledge before agents is not nostalgia for enterprise content management. It is the admission that agentic scale turns tacit org memory into a production dependency—and production dependencies need owners, tests, and truth.

Open the Ideation hub, or read the full episode transcript.