● Engineering deep dives

Did fully-managed RAG beat the pipeline we built?

June 27, 2026 · Jeen Lee

TL;DR — For Korean table-based government documents, our hand-built pgvector pipeline beat Amazon Bedrock's fully-managed Knowledge Base. Across 30 representative questions, ours pulled exact table-cell values; the managed KB often dropped the same tables, hurt by fixed 300-token chunking and retrieval we couldn't tune. On an Aurora instance we already run, the added RAG cost is near zero — so we kept our own pipeline.

We run a medical-advisory chatbot for school health teachers. There's one rule: answer only from official government documents — infectious-disease prevention and crisis-response manuals, school-health guidelines. If there's no basis, it doesn't make something up; it declines and hands off to a specialist professor.

So this service's quality isn't decided by the LLM's prose. It's decided by retrieval. Ask "the school-exclusion criteria for chickenpox" and it has to pull the exact cell value — "until all blisters crust over (at least 5 days after rash onset)." Miss that, and the bot goes silent.

We've been running that retrieval on a pipeline we built. Then in June 2026, Amazon Bedrock's fully-managed Knowledge Base went GA. The fixed-cost barrier was gone. So we got curious: would a RAG that AWS manages end-to-end beat the one we hand-built? Honestly, we half-expected to lose.

Hand-built pgvector pipeline vs Amazon Bedrock fully-managed Knowledge Base — A hand-built pgvector pipeline (left) vs Amazon Bedrock's fully-managed Knowledge Base (right)

Our pipeline

It's an ordinary design. That's the point.

A PDF comes in and we parse it with opendataloader — not flat text extraction, but preserving heading, paragraph, and table structure. The key step comes next: Korean structure-aware chunking. Instead of mechanically cutting every N tokens, we cut along the document's section and heading boundaries — so an "exclusion criteria" table stays in one piece with its header.

The chunks are embedded with Titan Text Embeddings v2 and stored in Aurora PostgreSQL's pgvector. HNSW index, cosine distance. On a query we pull the top-k chunks, build a context block, and hand it to Claude with a system prompt: "use only what's in the excerpts, and cite the source as [n] at the end of each sentence."

The cut strategy, the k value, which documents are on or off — all of it lives in our code. We can touch anything.

Fully-managed

The other side hands every one of those decisions to AWS. Parsing, chunking, embedding, retrieval are bundled into one box, and there's almost nothing for us to tune. In exchange, operational burden is zero.

Standing it up was the first snag. This managed KB isn't in the Seoul region — only Tokyo. So we replicated the corpus to S3 in Tokyo and built the KB there. Embedding is managed, so we didn't even need model-access requests. So far, easy.

To compare fairly, we pinned the generation step to the same Claude and the same prompt on both sides. We built one comparison endpoint and fired a single question down three paths at once: (1) our pipeline, (2) managed retrieval only, fed to the same Claude, (3) managed end-to-end RAG. We logged each path's answer, sources, and latency in one table.

The experiment

We threw 30 representative questions as single queries — fever protocols, per-disease exclusion periods, alert-level actions, reporting chains. The questions that actually come in from the field.

Then we added 8 follow-up sets to check context memory. Ask "exclusion for influenza?", and the moment it answers, follow with "and chickenpox?" The omitted "exclusion period" after "chickenpox" has to be restored from the prior turn to answer correctly.

Results

Context memory was a tie. On all 8 sets, both sides correctly filled in references like "and chickenpox?" or "and who do I report that to?" from the prior turn. Pass the conversation history properly and the model handles it.

The match was decided on tables. Our pipeline pointed straight at the cell values — oral exam "Mar–Nov," chickenpox "until crusting, at least 5 days after rash," page numbers and all. The managed side often dropped the same table — saying it "couldn't read the text." Latency averaged 5.4s vs 4.8s in the managed side's favor, but being faster doesn't help if you can't read it.

Path (3), end-to-end RAG, failed all 30. At first we wrote that off as "managed doesn't support a combined retrieve-and-generate mode." Wrong call. The docs later showed the API we called wasn't meant for managed KBs in the first place.¹ That's not a managed limitation — it's us knocking on the wrong door.

The result was clear: on Korean table documents, the hand-built side wins. But it took a few wrong turns to pin down why.

Where I was wrong

My first diagnosis: "Managed falls back to a free default parser, so it can't read tables. It isn't running OCR."

Re-reading the docs, that was wrong. This managed KB was running the smartest parser from the start. That's the default, and there's no other option.² Suspecting the parser was a dead end.

To confirm, we re-ingested the same docs with the smart parser explicitly named. 6 of 30 came back from "not found" to correct. Measles exclusion went from "can't find" to "7 days." But the chickenpox row in the very same table, right next to it, still didn't surface.

The culprit wasn't parsing — it was chunking.

Managed default chunking is a fixed ~300 tokens with 20% overlap, and there's no semantic-boundary option.³ That fixed length sliced a disease table through the middle of a row. The measles row happened to land whole in one chunk; the chickenpox row straddled a boundary and split in two. Even after fixing parsing, the chunking hidden behind it was untouchable — because we didn't have the authority to change how it cuts.

Wrong once more

"Chunking is the culprit" was only half right too. Before closing this out, we stood the KB back up and poked the retrieval directly — and the result corrected me again.

The key row was sitting in the index just fine. "Until all blisters crust over, at least 5 days after rash." It hadn't been cut away.

The problem was recall. The natural-language query "what's the isolation criteria for chickenpox?" didn't surface that row near the top. We bumped results to 20 and turned on the managed reranker,¹ but the ranking held. Reranking can't lift what recall missed — you can't reorder what isn't there.

So precisely: it's in the index, but the natural-language question can't pull it. And there's no way for us to fix that recall. Chatbot users don't stuff keywords. They just ask, "how many days off for chickenpox?"

Working on something similar?Get a free 30-min diagnosis →

Where the money leaks

On the surface, managed looks cheaper. You don't stand up a vector DB yourself, and embedding and reranking models are thrown in free.⁴ The trap is in that premise — "you don't stand up a vector DB."

We already run one. This service runs Aurora with or without RAG. pgvector just adds one column and one HNSW index on top. Because it rides on a DB that's already running, the new fixed cost RAG adds is near zero. Managed is the opposite: a dedicated RAG resource stands up, and ingest, retrieval, and storage all hit the bill as new line items. And ours was in Tokyo — so cross-region transfer piles on too.

What you pay for	pgvector (ours)	Managed KB
Vector store	Column + index on an already-running Aurora (marginal cost ≈ 0)	Dedicated managed storage (new fixed cost)
Ingest	One-off ECS task (minutes)	Billed per ingest request
Retrieval	Existing DB query + one Titan embedding	Billed per retrieval (embedding + reranking included)
Cross-region	None (single Seoul)	Tokyo cross-region transfer + corpus copy
Generation	Same Claude	Same Claude

So the comparison turns on one question: are you already running a DB? We are, so pgvector's RAG cost is nearly free. From a bare floor with no DB, the math flips — a dedicated vector store running 24/7 becomes the real burden, and per-retrieval managed pricing can be cheaper at low traffic.

A third option, OpenSearch

There's one candidate we deliberately left out: OpenSearch. A middle ground that keeps the KB's parsing and chunking but lets you control retrieval directly. The appeal is clear — it targets exactly the recall problem we hit. It blends BM25 keyword scores with kNN vector scores, and adding the Korean morphological analyzer nori sharpens keyword matching for words like "chickenpox" or "exclusion."

The problem is money. OpenSearch Serverless keeps a minimum unit running even at zero traffic, so hundreds of dollars a month sit on the floor.⁵ For our two-document, low-traffic project, that's overkill. When the corpus grows and natural-language recall accuracy becomes business-critical, OpenSearch hybrid can be the answer. Not for us right now.

The decision

We kept the pipeline in production.

Managed was smart. It just didn't give us the control we needed. To preserve a Korean table intact, we have to decide where it cuts — and we can't. The recall of key rows from natural-language questions is shaky, and there's no way to touch it. On top of that it's pinned to Tokyo, and storage and query fees leak even when idle.

What's left

RAG quality is parsing × chunking × retrieval. A product, not a sum — if one term nears zero, the rest can't save it. Here too, the moment we fixed parsing, chunking popped up as the next zero; and when we thought it was chunking, recall tripped us again.

"It's fully managed, so it must be optimal" missed. Managed converges on safe defaults and hides the fine-tuning options. For English prose those defaults might have been enough. In front of Korean government documents built from tables, that became the weakness.

Last, we left the wrong diagnoses in, not deleted. Reversing the "parser problem" call twice is, we think, the most useful record this comparison produced.

Designing and validating RAG for Korean, regulated domains like this is part of what we do in AX Consulting and Data & ML Engineering. The point is making "automate it with AI" survive in production.

References

AWS behavior and constraints above were verified against the official docs below. Comparison numbers (30 single-turn, 8 follow-up sets, measured retrieve) are from our own corpus. Verified: 2026-06.

Amazon Bedrock — Retrieving information from data sources. Retrieve vs RetrieveAndGenerate; Retrieve exposes reranking and result-count options. ↩ ↩²
Amazon Bedrock API — ParsingConfiguration · Parsing options. Managed KB allows Smart Parsing only. ↩
Amazon Bedrock — Customize ingestion for a managed knowledge base. Default chunking 300 tokens / 20% overlap, no semantic chunking. ↩
Amazon Bedrock — Build a managed knowledge base. Managed embedding and reranking included free. ↩
AWS Solutions — QnABot on AWS, Cost. Example monthly cost for OpenSearch-based RAG. ↩