Most RAG tutorials show you how to embed a PDF and query it. Real-world RAG is harder: multiple data sources with different schemas, storage constraints, retrieval quality tuning, and the question of how to ground an AI persona in domain knowledge without hallucinating wine pairings.
Vinny is a consumer-facing AI wine concierge that combines 71,540 curated wine reviews with 181 Wikipedia articles about grape varieties, regions, and winemaking techniques — all searchable through a single pgvector HNSW index on Supabase's free tier.
Why Two Tiers?
Wine questions fall into two distinct categories. "What's a good bold red under $30?" requires searching actual bottle data with tasting notes, scores, and prices. "What's the difference between Chablis and other Chardonnay?" requires domain knowledge about terroir, winemaking techniques, and regional classification.
A single-source RAG system forces you to choose: either your AI knows specific bottles but can't explain why a Burgundy Pinot Noir tastes different from an Oregon one, or it knows wine theory but can't recommend an actual bottle.
The two-tier approach separates these concerns while unifying retrieval:
Tier 1 — Wine Reviews (71,540 bottles): Sourced from a Kaggle dataset of ~130K reviews, filtered to points ≥ 88 as a quality gate. Each review includes the title, tasting description, variety, winery, country, province, score, and price. The embedding text combines title + description + variety, with title weighted first as the highest-signal identifier.
Tier 2 — Domain Knowledge (181 articles, 1,103 chunks): Curated Wikipedia articles across 10 categories — grape varieties, wine regions, appellations, classification systems, winemaking techniques, wine types, service, education, science, and history. Articles are chunked by paragraph boundaries at roughly 500 words per chunk.
Embedding Strategy: 512 Dimensions on Free Tier
Storage was the binding constraint. Supabase's free tier provides 500MB. At 1536 dimensions (the default for text-embedding-3-small), each vector consumes ~6.2KB. Multiply by 72,000+ records and you've blown the budget before adding indexes.
OpenAI's text-embedding-3-small natively supports a dimensions parameter. Setting it to 512 reduces per-vector storage to ~2KB — a 3x reduction with negligible semantic loss for cosine similarity operations. The entire wine corpus fits in ~146MB, leaving room for the knowledge base, HNSW indexes, user profiles, and tasting journal entries.
The embedding cost itself is trivial: roughly $2 to embed 71K wines and $0.03 for the Wikipedia batch at OpenAI's current pricing.
Unified Search: One RPC, Two Sources
Both tiers share the same vector(512) column type and HNSW index structure. A single Supabase RPC function queries both tables and returns results with source attribution:
CREATE OR REPLACE FUNCTION match_wine_documents( query_embedding vector(512), match_threshold float default 0.5, match_count int default 10, source_filter text default null ) RETURNS TABLE ( id bigint, source text, title text, content text, metadata jsonb, similarity float )
The source field tells the system prompt whether a result is a specific bottle recommendation (wines) or educational context (knowledge). Vinny's persona uses this to format citations differently — inline numbered references for bottles, contextual explanations for knowledge.
The 0.5 similarity threshold was tuned through test queries. Representative wine questions scored 0.60-0.64 similarity against relevant results, so 0.5 provides comfortable headroom without flooding results with noise.
Conversation Design: The Sommelier Persona
Vinny's system prompt establishes a persona that's "a friendly sommelier with the charm of a dinner party host and the expertise of a Master of Wine." But the technical constraints matter more than the personality:
No markdown headers in responses. Wine conversation should flow like prose, not documentation. The prompt explicitly forbids structured output formatting in chat responses.
Vintage handling via prompt, not data. Wine reviews reference specific vintages (2018 Caymus Cabernet), but recommending a specific vintage that's no longer available is worse than useless. The prompt instructs Vinny to present wines as producer/style recommendations with the caveat that catalog pricing may not reflect current retail.
Tool routing by intent. The chat endpoint detects explicit bottle-shopping intent ("where can I buy Caymus?") and routes to a web search tool before the general vector search. Pronoun follow-ups ("where can I buy this?") after Vinny recommends a wine are handled by injecting the current bottle context into the search query.
Injection hardening. All user-writable fields (display name, tasting notes, grape preferences) are sanitized for control characters and capped in length before injection into the system prompt. The persona is role-locked with explicit instruction-ignore defenses.
Personalization: Preferences + Tasting Journal
Authenticated users complete a 6-step onboarding quiz that captures preferred styles, sweetness/body preferences, favorite grapes, budget range, experience level, and typical occasions. These preferences are appended to the system prompt for every chat request.
The tasting journal adds a temporal dimension. Users log wines they've tried with ratings and notes. The 20 most recent entries are injected into context, so Vinny can reference past preferences: "You gave that Barolo an 85 last month — the Nebbiolo I'm suggesting has a similar tannic structure but more approachable early drinking."
All personalization data is protected by Supabase Row Level Security. The policy is simple: auth.uid() = id for profiles, auth.uid() = user_id for journal entries. No custom session middleware needed.
MCP: Vinny as a Service
Beyond the chat UI, Vinny exposes its wine tools via the Model Context Protocol. External AI agents (like Claude Desktop) can query the wine catalog through a JSON-RPC 2.0 endpoint:
The MCP endpoint authenticates via bearer token with constant-time comparison (crypto.timingSafeEqual), rate limits at 100 calls/day via Upstash Redis, and forces Node.js runtime for crypto API compatibility. Each request spins up a fresh McpServer instance — stateless by design for serverless deployment on Vercel.
This turns Vinny from a standalone chatbot into a reusable wine intelligence API that any agent framework can consume.
Rate Limiting: Three Tiers
Cost protection was non-negotiable. Every chat message hits the OpenAI API, so uncontrolled usage could generate real bills.
The rate limiter uses Upstash Redis with a sliding window algorithm:
- Anonymous users: 10 messages/day by IP address
- Authenticated users: 50 messages/day by user ID
- MCP clients: 100 calls/day by static key
Separate Redis prefixes prevent quota collisions between tiers. The middleware intercepts requests before they reach the chat endpoint, so rejected requests never touch OpenAI.
Data Pipeline: Idempotent Ingestion
The Kaggle dataset required careful handling. The ingestion script processes CSV records in 100-wine batches, generates embeddings via embedMany(), and upserts with a composite unique constraint on (title, taster_name). This makes re-runs fully idempotent — you can restart a failed ingestion without duplicating records.
Wikipedia ingestion follows the same pattern: fetch articles via the MediaWiki API, chunk by paragraph boundaries, batch embed, and upsert with a (title, chunk_index) unique constraint.
A subtle implementation detail: standard fetch with connection keep-alive caused issues with Supabase's connection pooler during bulk upserts. A custom noKeepAliveFetch wrapper resolved the undici connection pooling bugs.
Production Costs
The entire stack runs at near-zero cost:
| Service | Monthly Cost |
|---|---|
| Vercel (Hobby) | $0 |
| OpenAI (chat + embeddings) | ~$3-10 |
| Supabase (free tier) | $0 |
| Upstash Redis (free tier) | $0 |
| Total | ~$3-10 |
This isn't a demo deployment. The free tiers are production-viable because the architecture was designed around their constraints from day one: 512-dim embeddings for storage, sliding-window rate limiting for API costs, and edge-compatible middleware for compute.
What I'd Do Differently
Chunk size tuning matters more than embedding dimensions. I spent significant time optimizing vector dimensions but comparatively little on chunk boundaries. The Wikipedia paragraph-based chunking works, but variable-length chunks create inconsistent retrieval quality. Fixed-size chunks with overlap would likely improve results.
Vercel AI SDK's tool calling is excellent. The tool() abstraction with Zod schemas for parameters made adding new capabilities (web search, journal write, bottle verification) trivial. I'd adopt it earlier in future projects.
RLS is underrated for AI applications. Row Level Security eliminates an entire category of authorization bugs. The policy is one line of SQL, and it works across every query — chat context injection, journal CRUD, profile updates — without any application-level checks.
Tech Stack
- Framework: Next.js 16, React 19, TypeScript
- Database: Supabase (Postgres 15 + pgvector HNSW)
- AI: GPT-4.1-mini (chat), text-embedding-3-small (512-dim embeddings)
- AI SDK: Vercel AI v6 (
streamText,tool(), structured output) - Auth: Supabase Auth (Email + Google OAuth)
- Rate Limiting: Upstash Redis (sliding window)
- Web Search: Tavily (domain-constrained, verified URLs)
- MCP: @modelcontextprotocol/sdk (JSON-RPC 2.0 server)
- Hosting: Vercel (Hobby tier)