Two-Tier RAG at Scale: Architecture Behind an AI Beverage Concierge

Most RAG tutorials show you how to embed a PDF and query it. Real-world RAG is harder: multiple data sources with different schemas, storage constraints, retrieval quality tuning, and the question of how to ground an AI persona in domain knowledge without hallucinating wine pairings.

Vinny is a consumer-facing AI wine concierge that combines 71,540 curated wine reviews with 181 Wikipedia articles about grape varieties, regions, and winemaking techniques, all searchable through a single pgvector HNSW index on Supabase's free tier.

Why Two Tiers?

Wine questions fall into two distinct categories. "What's a good bold red under $30?" requires searching actual bottle data with tasting notes, scores, and prices. "What's the difference between Chablis and other Chardonnay?" requires domain knowledge about terroir, winemaking techniques, and regional classification.

A single-source RAG system forces you to choose: either your AI knows specific bottles but can't explain why a Burgundy Pinot Noir tastes different from an Oregon one, or it knows wine theory but can't recommend an actual bottle.

The two-tier approach separates these concerns while unifying retrieval:

flowchart TD
  accTitle: Two-tier RAG retrieval for the beverage concierge
  accDescr: A user question is checked for bottle-shopping intent. Shopping intent routes to a Tavily web search; otherwise the query is embedded and sent to a single Supabase RPC that searches Tier 1 wine catalog and Tier 2 knowledge articles over pgvector HNSW indexes. Candidates are reranked by Cohere and attributed to their source, then added to a system prompt carrying the sommelier persona, user preferences, and tasting journal before GPT-4.1-mini produces a grounded recommendation.
  q["User question"] --> intent{"Bottle-shopping intent?"}
  intent -->|"yes"| web["Tavily web search"]
  intent -->|"no"| embed["Embed query"]
  embed --> rpc["Unified Supabase RPC"]
  subgraph retrieval["pgvector HNSW retrieval"]
    rpc --> t1[("Tier 1: wine catalog")]
    rpc --> t2[("Tier 2: 181 knowledge articles")]
  end
  t1 --> rerank["Cohere rerank"]
  t2 --> rerank
  web --> ctx
  rerank -->|"results + source attribution"| ctx["System prompt: persona + preferences + journal"]
  ctx --> llm["GPT-4.1-mini"]
  llm --> ans["Grounded recommendation"]

Tier 1, Wine Reviews (71,540 bottles): Sourced from a Kaggle dataset of ~130K reviews, filtered to points ≥ 88 as a quality gate. Each review includes the title, tasting description, variety, winery, country, province, score, and price. The embedding text combines title + description + variety, with title weighted first as the highest-signal identifier. (This original Kaggle corpus was later re-sourced to an openly licensed dataset; see Catalog Re-Source and Embedding Modernization below.)

Tier 2, Domain Knowledge (181 articles, 1,103 chunks): Curated Wikipedia articles across 10 categories, grape varieties, wine regions, appellations, classification systems, winemaking techniques, wine types, service, education, science, and history. Articles are chunked by paragraph boundaries at roughly 500 words per chunk.

Embedding Strategy: 512 Dimensions on Free Tier

Storage was the binding constraint: a 500MB ceiling. At 1536 dimensions (the default for text-embedding-3-small), each vector consumes ~6.2KB. Multiply by 72,000+ records and you've blown the budget before adding indexes.

OpenAI's text-embedding-3-small natively supports a dimensions parameter. Setting it to 512 reduces per-vector storage to ~2KB, a 3x reduction with negligible semantic loss for cosine similarity operations. The entire wine corpus fits in ~146MB, leaving room for the knowledge base, HNSW indexes, user profiles, and tasting journal entries.

The one-time embedding cost was small: roughly $2 to embed 71K wines and $0.03 for the Wikipedia batch. (Embeddings were later moved back to the full native 1536 dimensions; see Catalog Re-Source and Embedding Modernization below.)

Unified Search: One RPC, Two Sources

Both tiers share the same vector(512) column type and HNSW index. One Supabase RPC queries both tables and returns results tagged by source:

CREATE OR REPLACE FUNCTION match_wine_documents(
  query_embedding vector(512),
  match_threshold float default 0.5,
  match_count int default 10,
  source_filter text default null
)
RETURNS TABLE (
  id bigint, source text, title text,
  content text, metadata jsonb, similarity float
)

The source field tells the system prompt whether a result is a specific bottle recommendation (wines) or educational context (knowledge). Vinny's persona uses this to format citations differently, inline numbered references for bottles, contextual explanations for knowledge.

The 0.5 similarity threshold was tuned through test queries. Representative wine questions scored 0.60-0.64 similarity against relevant results, so 0.5 provides comfortable headroom without flooding results with noise.

Conversation Design: The Sommelier Persona

Vinny's system prompt establishes a persona: "a friendly sommelier with the charm of a dinner party host and the expertise of a Master of Wine." The technical constraints around that persona did more work than the voice itself:

No markdown headers in responses. Wine conversation should flow like prose, not documentation. The prompt explicitly forbids structured output formatting in chat responses.

Vintage handling via prompt, not data. Wine reviews reference specific vintages (2018 Caymus Cabernet), but recommending a specific vintage that's no longer available is worse than useless. The prompt instructs Vinny to present wines as producer/style recommendations with the caveat that catalog pricing may not reflect current retail.

Tool routing by intent. The chat endpoint detects explicit bottle-shopping intent ("where can I buy Caymus?") and routes to a web search tool before the general vector search. Pronoun follow-ups ("where can I buy this?") after Vinny recommends a wine are handled by injecting the current bottle context into the search query.

Injection hardening. All user-writable fields (display name, tasting notes, grape preferences) are sanitized for control characters and capped in length before injection into the system prompt. The persona is role-locked with explicit instruction-ignore defenses.

Personalization: Preferences + Tasting Journal

Authenticated users complete a 6-step onboarding quiz that captures preferred styles, sweetness/body preferences, favorite grapes, budget range, experience level, and typical occasions. These preferences are appended to the system prompt for every chat request.

The tasting journal adds a temporal dimension. Users log wines they've tried with ratings and notes. The 20 most recent entries are injected into context, so Vinny can reference past preferences: "You gave that Barolo an 85 last month, the Nebbiolo I'm suggesting has a similar tannic structure but more approachable early drinking."

All personalization data is protected by Supabase Row Level Security. The policy is simple: auth.uid() = id for profiles, auth.uid() = user_id for journal entries. No custom session middleware needed.

MCP: Vinny as a Service

Beyond the chat UI, Vinny exposes its wine tools via the Model Context Protocol. External AI agents (like Claude Desktop) can query the wine catalog through a JSON-RPC 2.0 endpoint:

The MCP endpoint authenticates via bearer token with constant-time comparison (crypto.timingSafeEqual), rate limits at 100 calls/day via Upstash Redis, and forces Node.js runtime for crypto API compatibility. Each request spins up a fresh McpServer instance, stateless by design for serverless deployment on Vercel.

That makes the wine catalog a reusable API any agent framework can call, not just the chat UI's backend.

Rate Limiting: Three Tiers

Every chat message hits the OpenAI API, so uncontrolled usage turns directly into spend. Rate limiting bounds that exposure.

The rate limiter uses Upstash Redis with a sliding window algorithm:

Anonymous users: 10 messages/day by IP address
Authenticated users: 50 messages/day by user ID
MCP clients: 100 calls/day by static key

Separate Redis prefixes prevent quota collisions between tiers. The middleware intercepts requests before they reach the chat endpoint, so rejected requests never touch OpenAI.

Data Pipeline: Idempotent Ingestion

The Kaggle dataset required careful handling. The ingestion script processes CSV records in 100-wine batches, generates embeddings via embedMany(), and upserts with a composite unique constraint on (title, taster_name). This makes re-runs fully idempotent, you can restart a failed ingestion without duplicating records.

Wikipedia ingestion follows the same pattern: fetch articles via the MediaWiki API, chunk by paragraph boundaries, batch embed, and upsert with a (title, chunk_index) unique constraint.

A subtle implementation detail: standard fetch with connection keep-alive caused issues with Supabase's connection pooler during bulk upserts. A custom noKeepAliveFetch wrapper resolved the undici connection pooling bugs.

Designing Around Constraints

Three constraints shaped the architecture from day one, and each one drove a specific design choice:

512-dim embeddings kept the original corpus plus HNSW indexes inside the storage ceiling (later moved to halfvec(1536); see Catalog Re-Source and Embedding Modernization).
Sliding-window rate limiting bounds API spend before a request ever reaches the model.
Edge-compatible middleware keeps compute light enough to run on serverless.

What I'd Do Differently

Chunk size tuning matters more than embedding dimensions. I optimized vector dimensions hard and chunk boundaries barely at all. The Wikipedia paragraph-based chunking works, but variable-length chunks create inconsistent retrieval quality. Fixed-size chunks with overlap would likely improve results.

Vercel AI SDK's tool calling is excellent. The tool() abstraction with Zod schemas for parameters made adding new capabilities (web search, journal write, bottle verification) trivial. I'd adopt it earlier in future projects.

RLS is underrated for AI applications. Row Level Security eliminates an entire category of authorization bugs. The policy is one line of SQL, and it works across every query, chat context injection, journal CRUD, profile updates, without any application-level checks.

Phase 17: Multi-Category Schema (May 2026)

Vinny shipped wine-first. Pilot venues then asked the obvious question: what about beer, spirits, and cocktails? Phase 17 expanded the data model to cover all four beverage categories. The core design question: do this with one polymorphic beverages table, or separate tables per category?

I picked separate tables (ADR-014). Three reasons:

Column divergence is fundamental, not incidental. Wine has 15+ wine-specific columns (points, variety, winery, body, acidity, harmonize, vintage). Beer needs ibu, srm, style, substyle, is_craft, serving_temp. Spirits need proof, age_statement, cask_type, botanicals. Cocktails need ingredients (JSONB), technique, glassware, family, ice_type. A polymorphic table ends up with 50+ mostly-NULL columns. Index efficiency degrades when most rows don't use most indexed columns.

Vector space coherence. Separate HNSW indexes per table produce better recall. A query for "hoppy IPA" won't pull wine vectors into the candidate set. A unified index would mix unrelated embedding spaces, degrading nearest-neighbor accuracy.

RPC type safety. The existing hybrid_search_wines RPC uses typed filter parameters (filter_variety, filter_min_body). The same pattern extends cleanly: hybrid_search_beers(filter_style, filter_min_ibu). A polymorphic approach would need either one RPC with 30+ nullable parameters or runtime dispatch logic inside the RPC.

The downsides are real and named. More tables and RPCs to maintain (3 new tables, 3 new hybrid search RPCs, 3 new index sets). Cross-category queries require fan-out, handled by a unified search_beverages tool that dispatches to the appropriate RPC based on a category discriminator ('beer' | 'spirit' | 'cocktail'). For cross-category food pairings (e.g., "what pairs with steak?"), search_beverage_pairings queries a food_pairings table unified by a beverage_domain column.

Tenant-scoped enabledCategories config gates which categories each tenant exposes. A beer-only brewery sees search_beverages with only beer filters; a cocktail bar only cocktail filters. This also supports future product forks: a standalone "Beer Expert" deployment is the same codebase with enabledCategories: ['beer'].

Data sources being integrated (Phase 17, in progress):

WineVybe API, 50K+ beers and spirits with tasting profiles. The designated primary source for beer and spirits catalog data.
Catalog.beer (CC BY 4.0), structured beer data and brewery metadata, rate-limited so it serves enrichment rather than bulk catalog.
Open Brewery DB, 9,527 breweries for metadata enrichment.
TheCocktailDB, 636 cocktails with multi-ingredient filtering, glassware, technique.
Barnivore.com, vegan verification across all alcohol categories.

Migration is purely additive. The wines table is never modified. New tables, new RPCs, new indexes, zero regression risk to existing wine functionality.

Catalog Re-Source and Embedding Modernization (May 2026)

Two changes landed after the original build, both about putting the project on durable footing.

The wine catalog moved to an open license. The first catalog was seeded from a Kaggle WineEnthusiast dataset, the 71,540 wines described above. That data is licensed CC BY-NC-SA, which forbids commercial use. To keep the project clean for any future commercial path, I re-sourced the entire wine catalog to the CC0-licensed X-Wines dataset (~100K wines) and purged every non-CC0 row. Grapeminds stays in the stack as a live wine API for tasting notes, drinking windows, and region detail; it was always an enrichment layer, never the bulk catalog.

Embeddings moved to full fidelity. The original 512-dimension vectors were a storage optimization for the free tier. Postgres halfvec changed the trade-off: it stores each vector component in 2 bytes instead of 4, halving the ~6.2KB that a full-precision 1536-dimension vector would cost to roughly 3.1KB. That made text-embedding-3-small's native 1536 dimensions affordable, so the catalog was re-embedded at 1536 dimensions and stored as halfvec(1536). Retrieval now runs at full native fidelity while staying inside the storage budget.

Tech Stack

Framework: Next.js 16, React 19, TypeScript
Database: Supabase (Postgres 15 + pgvector HNSW), one table per beverage category
AI: GPT-4.1-mini (chat), GPT-4.1 (vision), text-embedding-3-small (1536-dim, stored as halfvec(1536))
AI SDK: Vercel AI v6 (streamText, tool(), structured output)
Reranking: Cohere rerank-v3.5
Auth: Supabase Auth (Email + Google OAuth)
Rate Limiting: Upstash Redis (sliding window)
Web Search: Tavily (domain-constrained, verified URLs)
Data sources: X-Wines (CC0 wine catalog), Grapeminds (live wine API); beer, spirits, and cocktails (WineVybe, TheCocktailDB, Open Brewery DB) in progress
MCP: @modelcontextprotocol/sdk (JSON-RPC 2.0 server)
Hosting: Vercel (Hobby tier)