RAG Chunking Strategies That Actually Work in Production

Why your RAG pipeline fails before retrieval even starts

You deployed embeddings, wired up Pinecone or pgvector, and shipped a chat endpoint. Users still get answers that cite the wrong paragraph, hallucinate policy details, or confidently summarize a section that never made it into the index. When that happens, teams reach for a bigger model or tweak the system prompt. The cheaper fix is almost always upstream: how you cut source material into chunks before it ever hits the vector store.

Chunking is not a preprocessing checkbox. It is the contract between your knowledge base and your retriever. A chunk that is too large dilutes the embedding signal. A chunk that splits mid-sentence severs causality. A chunk with no metadata cannot be filtered when the user asks about "last quarter's pricing" versus "enterprise SSO." Production RAG chunking strategies matter because retrieval quality has a ceiling set by ingest decisions you made weeks ago.

This guide focuses on patterns that survive real Node.js backends: TypeScript ingestion workers, batch embedding calls, upserts into vector databases, and reindex jobs when docs change. No toy notebooks. The goal is chunk boundaries you can explain to a teammate and measure with a simple eval set.

What chunking actually optimizes for

Embeddings map text into a vector space where semantic similarity approximates relevance. Chunking decides which sentences share that vector. You are optimizing for three tensions at once.

  • Precision: the retrieved chunk should contain the answer, not three adjacent topics.
  • Context: the chunk should include enough surrounding detail that the LLM can answer without guessing.
  • Recall: the right chunk must land in the top-k results when users phrase questions differently than the source doc.

Think of chunking like indexing a textbook. A good index entry points to one idea with enough page context to orient the reader. A bad index entry lists an entire chapter under one heading. RAG chunking strategies differ by content type because the "one idea" boundary moves: API reference pages break on endpoint headings, runbooks break on numbered steps, legal PDFs break on section numbers.

Your embedding model has a token limit, but that is not your target chunk size. Models accept long inputs; retrievers work best when each vector represents a coherent unit of meaning roughly between 200 and 800 tokens for most technical documentation. Start there, then tune with evals rather than folklore.

Fixed-size chunking: the baseline you should measure against

Fixed-size chunking splits text every N characters or tokens, optionally with overlap. It is the baseline every team should implement first—not because it wins, but because it is fast, deterministic, and gives you a number to beat.

In Node.js, token-aware splitting beats raw character counts when your sources mix code and prose. Approximate with gpt-tokenizer or the tokenizer your embedding provider documents. A typical starting point for internal docs:

const CHUNK_TOKENS = 512;
const OVERLAP_TOKENS = 64;

function chunkByTokens(text: string, encode: (s: string) => number[]): string[] {
  const ids = encode(text);
  const chunks: string[] = [];
  let start = 0;
  while (start < ids.length) {
    const end = Math.min(start + CHUNK_TOKENS, ids.length);
    chunks.push(decode(ids.slice(start, end)));
    if (end === ids.length) break;
    start = end - OVERLAP_TOKENS;
  }
  return chunks;
}

Fixed-size chunking shines on homogeneous text: support macros, chat transcripts, plain release notes. It fails loudly on structured docs—which is useful. When your baseline recall is poor, you know structure-aware splitting is worth the engineering time.

ApproachBest forMain risk
Fixed character splitQuick prototypesCuts words and URLs mid-token
Fixed token splitUniform proseIgnores headings and lists
Sentence-bounded token splitArticles, blogsLong sentences in legal or spec text

Ship the baseline to staging, run fifty real user questions against it, and save the metrics. Every fancier strategy gets compared to this line.

Structure-aware chunking for docs, code, and APIs

Most production knowledge bases are not flat strings. They are Markdown wikis, OpenAPI YAML, HTML help centers, and generated TypeScript doc comments. Structure-aware chunking uses document semantics—headings, fences, list depth—to place boundaries where human authors already signaled topic changes.

The pattern is consistent: parse into an AST or token stream, walk nodes, accumulate text until a boundary event, emit a chunk, carry optional overlap from the previous section's trailing sentences. Libraries like remark for Markdown or cheerio for HTML fit naturally into a Node.js ingest worker.

Respect heading boundaries in Markdown and HTML

A practical rule: never merge content across a heading of equal or higher level. If you are inside an h2 section, subsections (h3, h4) may concatenate until you hit the token budget, then split on paragraph boundaries.

type Section = { heading: string; level: number; body: string };

function sectionsFromMarkdown(md: string): Section[] {
  const tree = unified().use(remarkParse).use(remarkGfm).parse(md);
  // walk heading nodes, accumulate paragraphs per section
  return walkSections(tree);
}

function chunkSection(section: Section, maxTokens: number): string[] {
  if (tokenCount(section.heading + section.body) <= maxTokens) {
    return [`${section.heading}\n\n${section.body}`];
  }
  return splitOnParagraphs(section.body, maxTokens, section.heading);
}

Include the heading text in every chunk from that section. Users search with vocabulary from titles—"webhook retry policy," not "paragraph four." Heading prefixes improve embedding alignment without a separate title field, though you should still store heading_path in metadata for filters.

Keep code blocks intact as first-class chunks

Splitting inside a fetch() example destroys retrieval. Developers query with symbol names, error strings, and import paths. Treat fenced code blocks as atomic units. If a block exceeds your token budget, split on blank lines between logical groups, never mid-line.

Pair each code chunk with a prose chunk that describes it when the surrounding paragraph is long. Store content_type: "code" in metadata so you can boost or filter at query time. A question mentioning "TypeScript example" can prefer code chunks; a conceptual question can prefer prose.

Semantic chunking without boiling the ocean

Semantic chunking places boundaries where embedding similarity between adjacent sentences drops—idea boundaries inferred from vectors rather than punctuation. It can improve recall on narrative docs where headings are weak. It also adds ingest cost and moving parts.

A production-friendly compromise: use semantic boundaries only within sections already delimited by structure. That keeps cost bounded and avoids re-chunking an entire 40-page PDF in one GPU job.

async function semanticSplit(
  sentences: string[],
  embed: (t: string) => Promise<number[]>,
  threshold = 0.72
): Promise<string[][]> {
  const groups: string[][] = [];
  let current = [sentences[0]];
  let prev = await embed(sentences[0]);
  for (let i = 1; i < sentences.length; i++) {
    const vec = await embed(sentences[i]);
    const sim = cosineSimilarity(prev, vec);
    if (sim < threshold) {
      groups.push(current);
      current = [];
    }
    current.push(sentences[i]);
    prev = vec;
  }
  groups.push(current);
  return groups;
}

Batch embedding requests at ingest. Sequential per-sentence calls will timeout your worker on large corpora. Cache sentence vectors when overlap-based re-chunking runs nightly.

Semantic chunking is not magic. Low thresholds produce tiny fragments; high thresholds reproduce fixed-size blobs. Tune threshold on held-out questions, not on cosine similarity histograms alone.

Overlap, stride, and the boundary problem

Overlap exists because answers often sit on chunk borders. The sentence that defines your rate limit might be the last line of chunk 12 and the first line of chunk 13. Without overlap, only one fragment might land in top-k; with 10–15% token overlap, both fragments carry the critical sentence.

Overlap increases storage and embedding cost linearly. A 64-token overlap on 512-token chunks adds roughly twelve percent more vectors. That is usually cheaper than doubling k at query time to compensate for bad splits.

Use stride consciously: stride = chunk_size - overlap. Document these values in your ingest config. When someone asks why Pinecone bill jumped, you want a one-line answer: "we increased overlap from 32 to 128 tokens on the compliance corpus."

  • Short FAQs: minimal or zero overlap; chunks are already atomic.
  • Long narrative docs: 50–100 token overlap at 400–600 token chunk size.
  • Code-heavy docs: overlap prose tails into code headers, not into function bodies.

Metadata that makes chunks retrievable

Vectors alone are insufficient in multi-tenant SaaS backends. Metadata is how you filter before similarity search: tenant ID, product area, doc version, language, content type, source URL, last_modified.

interface ChunkRecord {
  id: string;
  text: string;
  embedding: number[];
  metadata: {
    tenant_id: string;
    source_id: string;
    source_url: string;
    heading_path: string[];
    content_type: "prose" | "code" | "table";
    token_count: number;
    doc_version: string;
    indexed_at: string;
  };
}

Store heading paths as arrays, not flattened strings, so you can filter heading_path[0] === "API Reference" in databases that support JSON metadata predicates. pgvector with PostgreSQL JSONB shines here; Pinecone metadata filters work the same way conceptually.

Avoid stuffing entire parent documents into metadata. Store pointers. The LLM context window gets the chunk text; if you need parent context at generation time, fetch adjacent chunks by source_id and chunk_index after retrieval—a pattern called small-to-big or parent-document retrieval.

Node.js ingestion patterns that scale

Ingest belongs in a worker, not the request path of your Express or Fastify API. A typical flow: object storage event triggers a job, the worker fetches the raw doc, normalizes format, chunks, embeds in batches, upserts to the vector store, writes a manifest row in PostgreSQL for idempotency.

A reusable chunk pipeline in TypeScript

Compose chunking as a pipeline of pure functions. Each stage is testable without calling OpenAI or Pinecone.

type Chunk = { text: string; meta: Record<string, unknown> };

type Chunker = (doc: NormalizedDoc) => Chunk[];

const pipeline: Chunker[] = [
  stripBoilerplate,
  splitByStructure,
  enforceMaxTokens(600),
  attachHeadingPaths,
  dedupeNearIdentical,
];

export function chunkDocument(doc: NormalizedDoc): Chunk[] {
  return pipeline.reduce((chunks, step) => step({ ...doc, chunks }), []);
}

Keep raw source in S3 or blob storage; store chunk manifests with content hashes. Re-chunk only when the hash changes. Teams that re-embed everything nightly pay for instability, not freshness.

Batching embeddings and vector upserts

Embedding APIs accept batches—often dozens to hundreds of inputs per call depending on provider limits. Aggregate chunks to the batch size, handle rate limits with exponential backoff, and dead-letter failures with the chunk IDs that failed so retries are surgical.

async function embedAndUpsert(chunks: Chunk[], deps: Deps) {
  const BATCH = 64;
  for (let i = 0; i < chunks.length; i += BATCH) {
    const batch = chunks.slice(i, i + BATCH);
    const vectors = await deps.embed(batch.map((c) => c.text));
    const records = batch.map((c, j) => ({
      id: c.meta.chunk_id as string,
      values: vectors[j],
      metadata: { ...c.meta, text: truncate(c.text, 8000) },
    }));
    await deps.vectorStore.upsert(records);
  }
}

Some stores cap metadata payload size. Truncate stored text for the vector DB but keep full text in PostgreSQL keyed by chunk_id if your generation step needs the complete chunk. Retrieval returns IDs; hydration hits your primary database. That split is common in production RAG backends and keeps vector metadata lean.

Picking chunk size for your vector database

Vector databases do not dictate chunk size, but they influence ops. Pinecone, Weaviate, Milvus, and pgvector all handle high-dimensional vectors similarly for our purposes; the constraint is embedding model quality at a given length and your query latency budget.

Content typeSuggested token rangeNotes
API reference200–400One endpoint or schema block per chunk
Internal wiki400–600Heading-bounded sections
Support tickets (batch)300–500Strip PII before embed
Legal / compliance250–350Smaller chunks, higher overlap
Code repositories150–300 per file segmentSplit on functions or classes

When you increase chunk size, monitor answer faithfulness, not just retrieval hit rate. Bigger chunks retrieve more often but inject noise into the LLM context, which shows up as vague or contradictory answers.

Dimension and distance metric choices are orthogonal to chunking, but re-chunking usually requires re-embedding. Version your chunk config (chunk_schema_version: 3) in metadata so you can run blue/green indexes during migrations.

Evaluating chunk quality before you ship

Build a eval set of forty to sixty question–answer–source triples from real tickets, sales calls, or Slack questions. For each question, record which document and section should answer it. Run retrieval-only evals: did the correct chunk appear in top-3?

  • Hit@k: correct chunk in top k results.
  • MRR: how high the first correct chunk ranks.
  • Fragmentation: answer split across multiple chunks with none ranking high—signals overlap or size issues.
  • Contamination: top chunk contains the answer plus unrelated policy—signals chunks too large.

Log failed queries in production with the retrieved chunk IDs (not user PII). Review weekly. Chunk tuning is a product loop, not a one-time ingest script.

A/B two chunk configs on shadow traffic before flipping the production index. Node workers can dual-write to index_v4 and index_v5 while the API compares hit rates offline.

Production mistakes teams repeat

Chunking PDFs as plain text. Column layouts and footers interleave sentences. Use a PDF parser that outputs reading order, or convert to Markdown at source.

Ignoring deduplication. Navbar text repeated on every page creates near-duplicate vectors that crowd out useful hits. Strip templates and chrome.

Embedding summaries instead of source. LLM summaries compress detail and drift on numbers. Embed primary text; use summaries for display if needed.

One global chunk config. Your API docs and Slack export need different strategies. Route by source_type in the ingest worker.

No idempotent upserts. Stable chunk IDs derived from hash(source_id + heading_path + chunk_index) prevent duplicate vectors when jobs retry.

Skipping whitespace normalization. Odd Unicode spaces and HTML entities fragment matches. Normalize once in NormalizedDoc.

Summary

RAG chunking strategies that work in production start with a measurable baseline—fixed token splits with overlap—then add structure-aware boundaries for Markdown, HTML, and code. Semantic splits earn their keep inside large sections, not across entire corpora. Metadata and stable chunk IDs turn vector search into something you can filter, reindex, and debug from a Node.js worker without heroic manual cleanup.

Pick chunk sizes from content type, not from embedding model maximums. Evaluate with real questions, log retrieval failures, and version your chunk schema when you change stride or overlap. The retriever you ship today is only as good as the boundaries you drew at ingest—and those boundaries are absolutely something your backend team can own.

FAQ

What chunk size should I start with for a Node.js RAG API?

Start at 512 tokens with 64 tokens overlap for general technical documentation. Implement heading-bounded splits first, then tune down to 384 if answers feel noisy or up to 600 if retrieval misses context. Measure hit@3 on fifty real queries before changing defaults.

Should I chunk before or after cleaning HTML?

Clean first, chunk second. Remove nav, scripts, and duplicate chrome, convert to a normalized Markdown or plain text representation, then apply structure-aware chunking. Chunking raw HTML without parsing tends to embed tag soup and wreck similarity scores.

Is semantic chunking worth the extra latency at ingest time?

It is worth testing on narrative content where headings are sparse—case studies, postmortems, exported Notion pages. For API references and runbooks, structural rules usually beat semantic splits with lower cost. Run both on a sample corpus and compare hit@3 before committing.

How do I handle tables and lists that span multiple paragraphs?

Keep tables as single chunks when under your token budget; otherwise split on row groups with repeated header rows in each chunk. For numbered procedures, never split between a step number and its action sentence. Store content_type: "table" or "list" in metadata for targeted filtering.

Does overlap hurt storage cost in Pinecone or pgvector?

Yes, linearly with overlap percentage. Ten percent overlap roughly adds ten percent more vectors. That is usually acceptable compared to the retrieval gains on borderline answers. Track vector count per tenant so billing surprises are explainable.

How often should I re-chunk when source documents change?

Re-chunk on content hash change, not on a blind schedule. Webhooks from your CMS or Git repository should enqueue ingest jobs for affected sources only. Full reindex quarterly is a safety net, not the primary freshness mechanism.