From Prototype to Production: Building AI Systems That Don't Fall Over
The gap between a Jupyter notebook and a production AI system is enormous. What changes at scale, what breaks first, and the architecture decisions that prevent expensive rebuilds.
Table of contents +
There is a familiar moment in AI projects. The prototype works. The notebook returns good answers. A demo gets shown to a stakeholder. The team commits to ship it. Three months later, the system is in production, on fire, and someone is asking what “p95 latency” means in a slack channel at 2 AM.
The gap between a working AI prototype and a production AI system is not “more code.” It is a different category of system. This is what changes, what breaks, and how we architect around it.
Five things that change at scale
When the system goes from notebook to production, five problems show up roughly simultaneously:
- Latency budgets get real. A model that took two seconds on your laptop now needs to return in 300ms across a thousand concurrent requests.
- Cost becomes a function of design. Token usage, GPU hours, vector DB reads. Every architectural decision becomes a line item.
- Quality regressions become invisible. Without evals, you find out the model got worse from a customer email, not a metric.
- Failure modes multiply. Rate limits, model timeouts, prompt injection, schema drift in inputs. The notebook never hit any of them.
- The model is no longer the system. It’s a component inside a system that has retrieval, ranking, caching, guardrails, fallback logic, and observability.
A team that built the prototype is usually equipped for problem 1. The other four require different skills.
The architecture in four layers
A production AI system worth building looks roughly like this:
The mistake is treating these as optional. The mistake’s cousin is treating them as sequential, building the interface first and adding observability later. Observability that goes in last is observability that never gets used.
The retrieval problem
RAG (retrieval-augmented generation) is the most common AI architecture in production right now and the one most often built wrong. The failure mode is consistent: a team indexes their documents into a vector database, wires it to an LLM, and discovers the system has a confident answer for every question and is wrong half the time.
The retrieval layer needs at least:
- Chunking that respects structure. Splitting on token count alone shreds meaning. Split on document structure (headings, sections) first.
- Hybrid retrieval. Vector search plus keyword search, with a re-ranker on top. Pure semantic search gets fooled by paraphrase too easily.
- Evals against a real query set. Not synthetic. Real questions from real users.
This is where most “AI feels off” complaints actually originate. The model is fine. The retrieval is bringing it the wrong context.
What breaks first
The order in which production AI systems tend to break:
- Model latency spikes under load (week 1)
- Cost surprises (week 2)
- Hallucinations on edge cases that no one tested (week 3)
- Quality drift as prompt or model versions change (month 2)
- Prompt injection or jailbreaks from real users probing the system (month 3)
If your roadmap doesn’t have at least an answer to each of these, plan for surprises.
Tooling we actually recommend
There is a lot of AI infrastructure tooling. Most of it is fine. The choices that matter:
- Tracing: LangSmith, Helicone, Phoenix, or roll your own around OpenTelemetry. Pick one. Trace every request.
- Evals: PromptFoo or Inspect, plus a custom eval harness for your specific tasks.
- Vector DB: pgvector if you already run Postgres. Qdrant if you don’t. Pinecone if managed is non-negotiable.
- Model gateways: LiteLLM, OpenRouter, or similar. Don’t hardcode a single model provider unless you mean to.
What we keep avoiding: heavy “AI orchestration” frameworks that obscure what’s happening. The systems that age well are the ones a new engineer can read in an hour.
Where this connects to Hooman’s work
Production AI systems are the most common reason teams reach out to us at the moment. The work usually spans AI/ML architecture, DevOps and infrastructure, and product engineering. Recent examples include the Nosana platform and adjacent decentralized compute work.
If you have a prototype that works and a production deployment that doesn’t, the gap is usually a known set of decisions about architecture, evals, and observability. Schedule a call and we’ll walk through the specifics.
Key takeaways
- Five things change at scale simultaneously: latency budgets, cost as a function of design, invisible quality regressions, multiplying failure modes, and the model becoming a component, not the system.
- Build the four layers (interface, orchestration, model, eval+observability) together, observability added last is observability that never gets used.
- RAG quality is usually a retrieval problem, not a model problem, structure-aware chunking, hybrid retrieval, and re-ranking carry most of the lift.
- Production AI breaks in a predictable order: latency week 1, cost week 2, hallucinations week 3, quality drift month 2, prompt injection month 3.
- Avoid heavy AI orchestration frameworks that obscure what's happening, systems that age well are readable by a new engineer in under an hour.
Frequently asked
What changes when moving an AI prototype to production? +
Five things happen roughly simultaneously: latency budgets become real (a model that took 2 seconds on a laptop needs 300ms across a thousand concurrent requests), cost becomes a function of design (token usage, GPU hours, vector DB reads turn into line items), quality regressions become invisible without evals, failure modes multiply (rate limits, timeouts, prompt injection, schema drift), and the model is no longer the system, it's a component inside one.
What does a production AI system architecture look like? +
Four layers, built together rather than sequentially: an interface layer (chat, API, app surface, structured output), an orchestration layer (routing, retrieval, ranking, guardrails, fallback), a model layer (LLMs, embeddings, classifiers, often more than one), and an eval and observability layer (traces, metrics, eval suites, on-call). The mistake is treating any of these as optional or adding observability last.
Why do most RAG systems return wrong answers? +
The retrieval is bringing the model the wrong context. The model is usually fine. Production RAG needs at least chunking that respects document structure (not just token count), hybrid retrieval combining vector search with keyword search and a re-ranker on top, and evaluation against a real query set, not synthetic questions. Most 'AI feels off' complaints originate in retrieval.
What AI infrastructure tools are worth standardizing on? +
For tracing pick one of LangSmith, Helicone, Phoenix, or roll your own around OpenTelemetry and trace every request. For evals use PromptFoo or Inspect plus a custom harness. For vector DB use pgvector if you already run Postgres, Qdrant if not, Pinecone if managed is non-negotiable. For model routing use LiteLLM or OpenRouter, don't hardcode a single provider. Avoid heavy orchestration frameworks that obscure what's happening.