Adding AI Features to a Web App: LLMs, RAG & Agents

By Muhammad Hussain20 January 2026Updated 5 March 202610 min read

AI
LLM
RAG
Node.js
TypeScript

Adding AI to your web app is easier than it was two years ago and harder than the demos make it look. The OpenAI API or the Anthropic SDK gets you from zero to a working chat box in twenty minutes. Getting that same feature to be reliable, cost-controlled, and genuinely useful in production takes a lot more thought.

This guide covers the decisions that matter: which AI pattern to use, how to integrate it cleanly into a Node.js backend, how to keep costs predictable, and how to avoid the things that routinely go wrong in production AI features.

Start With the Right Question

Before touching any SDK, ask: what specific user problem does this AI feature solve, and why is AI the right solution?

The best AI features answer questions like:

"Summarise this 40-page document for me."
"Draft a reply to this customer email."
"Find every invoice from last quarter that mentions project X."
"Explain why this sensor reading is anomalous."

The worst AI features are solutions looking for a problem: chatbots that answer questions the UI already answers clearly, AI-generated content that users cannot trust and therefore do not read.

Three Patterns: LLM, RAG, and Agents

Most AI features in web apps fit one of three patterns. Choosing the right one determines most of your implementation complexity.

1. Direct LLM Call

Send a prompt, get a response. Right for: content generation, summarisation, classification, extraction from structured data you provide in the prompt.

import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic(); // reads ANTHROPIC_API_KEY from env

export async function summariseText(text: string): Promise<string> {
  const message = await client.messages.create({
    model: 'claude-opus-4-5',
    max_tokens: 512,
    messages: [
      {
        role: 'user',
        content: `Summarise the following text in 3 bullet points. Be specific, not generic.\n\n<text>\n${text}\n</text>`,
      },
    ],
  });

  const block = message.content[0];
  if (block.type !== 'text') throw new Error('Unexpected response type');
  return block.text;
}

This works well when the data you need to reason about fits in the context window (which is very large on modern models). The failure mode: if the data does not fit, or if the user asks about data the model was not trained on, you get hallucinations.

2. Retrieval-Augmented Generation (RAG)

Retrieve relevant documents, inject them into the prompt, ask the model to answer based only on those documents. Right for: Q&A over private documents, knowledge bases, customer support with a large help centre, legal or compliance research.

The RAG pipeline:

Ingest: chunk your documents, generate embeddings, store in a vector DB (Pinecone, pgvector, Supabase Vector)
Retrieve: on each user query, embed the query and find the top-k most similar chunks
Generate: send the retrieved chunks + user query to the LLM

import { OpenAI } from 'openai';
import { Pinecone } from '@pinecone-database/pinecone';

const openai = new OpenAI();
const pinecone = new Pinecone({ apiKey: process.env.PINECONE_API_KEY! });

export async function ragQuery(
  userQuery: string,
  namespace: string
): Promise<string> {
  // 1. Embed the query
  const embeddingRes = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: userQuery,
  });
  const queryVector = embeddingRes.data[0].embedding;

  // 2. Retrieve top-5 chunks
  const index = pinecone.index('docs').namespace(namespace);
  const results = await index.query({
    vector: queryVector,
    topK: 5,
    includeMetadata: true,
  });

  const context = results.matches
    .map((m) => m.metadata?.text ?? '')
    .join('\n\n---\n\n');

  // 3. Generate with context
  const response = await openai.chat.completions.create({
    model: 'gpt-4o-mini',
    messages: [
      {
        role: 'system',
        content: 'Answer the user question using only the provided context. If the answer is not in the context, say so.',
      },
      {
        role: 'user',
        content: `Context:\n${context}\n\nQuestion: ${userQuery}`,
      },
    ],
  });

  return response.choices[0].message.content ?? '';
}

RAG is more complex than a direct LLM call, but it grounds the model in your data and eliminates the hallucination risk for questions that have a definite answer in your knowledge base.

3. AI Agents

An agent can take actions — call APIs, query databases, run code, send messages — not just generate text. Right for: automating workflows, research tasks with multiple steps, anything where the model needs to figure out what to do next.

Agents are the most powerful and the most brittle. They require:

Well-defined tools with clear schemas
Guardrails to prevent runaway loops
Logging so you can debug what happened
Human-in-the-loop steps for high-stakes actions

Start with direct LLM calls. Add RAG when your data does not fit in context. Add agents only when the workflow genuinely requires multi-step decision-making.

Structuring Prompts for Production

Prompt engineering matters more than most developers expect. A few principles:

Use system prompts to set behaviour, not just context. The system message defines the model's role, constraints, and output format. Users cannot override it.

Be explicit about format. Ask for JSON, bullet points, or a specific structure and validate the output.

const CLASSIFICATION_SYSTEM = `You classify customer support tickets into one of these categories:
billing, technical, feature-request, or other.

Respond with valid JSON only, in this exact shape:
{"category": "billing", "confidence": 0.92, "reasoning": "..."}

Do not include any text outside the JSON.`;

Put instructions before data. Models attend to the beginning of the context more reliably than the middle. "Summarise the following text: ..." outperforms "... — please summarise this."

Streaming for Better UX

Streaming is one of the highest-leverage UX improvements for any AI feature. Instead of the user waiting 10 seconds for the full response to appear, it starts appearing in under a second.

In a Next.js API route:

// app/api/chat/route.ts
import Anthropic from '@anthropic-ai/sdk';

export async function POST(req: Request) {
  const { message } = await req.json();
  const client = new Anthropic();

  const stream = client.messages.stream({
    model: 'claude-haiku-4-5',
    max_tokens: 1024,
    messages: [{ role: 'user', content: message }],
  });

  const encoder = new TextEncoder();
  const readable = new ReadableStream({
    async start(controller) {
      for await (const chunk of stream) {
        if (
          chunk.type === 'content_block_delta' &&
          chunk.delta.type === 'text_delta'
        ) {
          controller.enqueue(encoder.encode(chunk.delta.text));
        }
      }
      controller.close();
    },
  });

  return new Response(readable, {
    headers: { 'Content-Type': 'text/plain; charset=utf-8' },
  });
}

On the client, read the stream with a ReadableStream reader and append chunks to state as they arrive.

Controlling Costs in Production

AI costs scale with usage in ways that are easy to underestimate. A feature that costs $0.002 per call at 100 users/day costs $200/day at 100,000 users/day.

Levers to control cost:

Choose the right model for the task. Use a small fast model (Claude Haiku, GPT-4o mini) for classification and short generation. Use a large model only when the task genuinely needs it.
Cache repeated queries. If 40% of your users ask the same three questions, cache those responses in Redis.
Set max_tokens explicitly. Never let the model run to its default maximum.
Budget by user tier. Free users get 10 AI queries/day. Pro users get 500. Track usage in your database and gate at the API layer.

// middleware: check AI budget before calling the model
export async function checkAiBudget(userId: string): Promise<void> {
  const today = new Date().toISOString().slice(0, 10);
  const key = `ai:usage:${userId}:${today}`;
  const count = await redis.incr(key);
  await redis.expire(key, 86_400);

  const user = await User.findById(userId).select('plan');
  const limit = user?.plan === 'pro' ? 500 : 10;

  if (count > limit) {
    throw new Error('Daily AI usage limit reached');
  }
}

Handling Hallucinations

LLMs generate confident-sounding incorrect statements. This is not a bug you can patch; it is a property of how the models work. Design your feature with this in mind:

Do not use LLMs for factual lookups without RAG. "What is our refund policy?" must be answered from your actual policy document, not from the model's training data.
Show sources. When using RAG, cite the document chunks the answer came from. Users can verify, and trust increases.
Use structured outputs and validation. If the model is supposed to return JSON, validate the schema. Retry with a clarifying prompt if the output is malformed.
Add human review for high-stakes actions. An agent that drafts an email is useful. An agent that sends the email unsupervised is a liability.

Adding AI features to your web app is a product decision as much as an engineering one. If you want experienced engineers to design and ship the integration, see our AI/ML integration service or contact us to discuss your specific use case.

Need help with this? See our related service or get in touch.

Start a project →