Adding AI Features to a Web App: LLMs, RAG & Agents
- AI
- LLM
- RAG
- Node.js
- TypeScript
Adding AI to your web app is easier than it was two years ago and harder than the demos make it look. The OpenAI API or the Anthropic SDK gets you from zero to a working chat box in twenty minutes. Getting that same feature to be reliable, cost-controlled, and genuinely useful in production takes a lot more thought.
This guide covers the decisions that matter: which AI pattern to use, how to integrate it cleanly into a Node.js backend, how to keep costs predictable, and how to avoid the things that routinely go wrong in production AI features.
Start With the Right Question
Before touching any SDK, ask: what specific user problem does this AI feature solve, and why is AI the right solution?
The best AI features answer questions like:
- "Summarise this 40-page document for me."
- "Draft a reply to this customer email."
- "Find every invoice from last quarter that mentions project X."
- "Explain why this sensor reading is anomalous."
The worst AI features are solutions looking for a problem: chatbots that answer questions the UI already answers clearly, AI-generated content that users cannot trust and therefore do not read.
Three Patterns: LLM, RAG, and Agents
Most AI features in web apps fit one of three patterns. Choosing the right one determines most of your implementation complexity.
1. Direct LLM Call
Send a prompt, get a response. Right for: content generation, summarisation, classification, extraction from structured data you provide in the prompt.
import Anthropic from '@anthropic-ai/sdk';
const client = new Anthropic(); // reads ANTHROPIC_API_KEY from env
export async function summariseText(text: string): Promise<string> {
const message = await client.messages.create({
model: 'claude-opus-4-5',
max_tokens: 512,
messages: [
{
role: 'user',
content: `Summarise the following text in 3 bullet points. Be specific, not generic.\n\n<text>\n${text}\n</text>`,
},
],
});
const block = message.content[0];
if (block.type !== 'text') throw new Error('Unexpected response type');
return block.text;
}
This works well when the data you need to reason about fits in the context window (which is very large on modern models). The failure mode: if the data does not fit, or if the user asks about data the model was not trained on, you get hallucinations.
2. Retrieval-Augmented Generation (RAG)
Retrieve relevant documents, inject them into the prompt, ask the model to answer based only on those documents. Right for: Q&A over private documents, knowledge bases, customer support with a large help centre, legal or compliance research.
The RAG pipeline:
- Ingest: chunk your documents, generate embeddings, store in a vector DB (Pinecone, pgvector, Supabase Vector)
- Retrieve: on each user query, embed the query and find the top-k most similar chunks
- Generate: send the retrieved chunks + user query to the LLM
import { OpenAI } from 'openai';
import { Pinecone } from '@pinecone-database/pinecone';
const openai = new OpenAI();
const pinecone = new Pinecone({ apiKey: process.env.PINECONE_API_KEY! });
export async function ragQuery(
userQuery: string,
namespace: string
): Promise<string> {
// 1. Embed the query
const embeddingRes = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: userQuery,
});
const queryVector = embeddingRes.data[0].embedding;
// 2. Retrieve top-5 chunks
const index = pinecone.index('docs').namespace(namespace);
const results = await index.query({
vector: queryVector,
topK: 5,
includeMetadata: true,
});
const context = results.matches
.map((m) => m.metadata?.text ?? '')
.join('\n\n---\n\n');
// 3. Generate with context
const response = await openai.chat.completions.create({
model: 'gpt-4o-mini',
messages: [
{
role: 'system',
content: 'Answer the user question using only the provided context. If the answer is not in the context, say so.',
},
{
role: 'user',
content: `Context:\n${context}\n\nQuestion: ${userQuery}`,
},
],
});
return response.choices[0].message.content ?? '';
}
RAG is more complex than a direct LLM call, but it grounds the model in your data and eliminates the hallucination risk for questions that have a definite answer in your knowledge base.
3. AI Agents
An agent can take actions — call APIs, query databases, run code, send messages — not just generate text. Right for: automating workflows, research tasks with multiple steps, anything where the model needs to figure out what to do next.
Agents are the most powerful and the most brittle. They require:
- Well-defined tools with clear schemas
- Guardrails to prevent runaway loops
- Logging so you can debug what happened
- Human-in-the-loop steps for high-stakes actions
Start with direct LLM calls. Add RAG when your data does not fit in context. Add agents only when the workflow genuinely requires multi-step decision-making.
Structuring Prompts for Production
Prompt engineering matters more than most developers expect. A few principles:
Use system prompts to set behaviour, not just context. The system message defines the model's role, constraints, and output format. Users cannot override it.
Be explicit about format. Ask for JSON, bullet points, or a specific structure and validate the output.
const CLASSIFICATION_SYSTEM = `You classify customer support tickets into one of these categories:
billing, technical, feature-request, or other.
Respond with valid JSON only, in this exact shape:
{"category": "billing", "confidence": 0.92, "reasoning": "..."}
Do not include any text outside the JSON.`;
Put instructions before data. Models attend to the beginning of the context more reliably than the middle. "Summarise the following text: ..." outperforms "... — please summarise this."
Streaming for Better UX
Streaming is one of the highest-leverage UX improvements for any AI feature. Instead of the user waiting 10 seconds for the full response to appear, it starts appearing in under a second.
In a Next.js API route:
// app/api/chat/route.ts
import Anthropic from '@anthropic-ai/sdk';
export async function POST(req: Request) {
const { message } = await req.json();
const client = new Anthropic();
const stream = client.messages.stream({
model: 'claude-haiku-4-5',
max_tokens: 1024,
messages: [{ role: 'user', content: message }],
});
const encoder = new TextEncoder();
const readable = new ReadableStream({
async start(controller) {
for await (const chunk of stream) {
if (
chunk.type === 'content_block_delta' &&
chunk.delta.type === 'text_delta'
) {
controller.enqueue(encoder.encode(chunk.delta.text));
}
}
controller.close();
},
});
return new Response(readable, {
headers: { 'Content-Type': 'text/plain; charset=utf-8' },
});
}
On the client, read the stream with a ReadableStream reader and append chunks to state as they arrive.
Controlling Costs in Production
AI costs scale with usage in ways that are easy to underestimate. A feature that costs $0.002 per call at 100 users/day costs $200/day at 100,000 users/day.
Levers to control cost:
- Choose the right model for the task. Use a small fast model (Claude Haiku, GPT-4o mini) for classification and short generation. Use a large model only when the task genuinely needs it.
- Cache repeated queries. If 40% of your users ask the same three questions, cache those responses in Redis.
- Set max_tokens explicitly. Never let the model run to its default maximum.
- Budget by user tier. Free users get 10 AI queries/day. Pro users get 500. Track usage in your database and gate at the API layer.
// middleware: check AI budget before calling the model
export async function checkAiBudget(userId: string): Promise<void> {
const today = new Date().toISOString().slice(0, 10);
const key = `ai:usage:${userId}:${today}`;
const count = await redis.incr(key);
await redis.expire(key, 86_400);
const user = await User.findById(userId).select('plan');
const limit = user?.plan === 'pro' ? 500 : 10;
if (count > limit) {
throw new Error('Daily AI usage limit reached');
}
}
Handling Hallucinations
LLMs generate confident-sounding incorrect statements. This is not a bug you can patch; it is a property of how the models work. Design your feature with this in mind:
- Do not use LLMs for factual lookups without RAG. "What is our refund policy?" must be answered from your actual policy document, not from the model's training data.
- Show sources. When using RAG, cite the document chunks the answer came from. Users can verify, and trust increases.
- Use structured outputs and validation. If the model is supposed to return JSON, validate the schema. Retry with a clarifying prompt if the output is malformed.
- Add human review for high-stakes actions. An agent that drafts an email is useful. An agent that sends the email unsupervised is a liability.
Adding AI features to your web app is a product decision as much as an engineering one. If you want experienced engineers to design and ship the integration, see our AI/ML integration service or contact us to discuss your specific use case.
Need help with this? See our related service or get in touch.
Start a project →