← All essays
15.04.268 MIN READ

Andrej Karpathy Just Killed My RAG Pipeline.

On April 3, 2026, Karpathy posted a gist with no new framework, no benchmarks, and no infra. It got 16 million views on X and 5,000 stars. The reason is that it quietly showed the RAG stack most of us were paying for was never needed at our scale.

On April 3, 2026, Andrej Karpathy posted a gist titled "LLM Wiki". The linked X post has 16 million views as of this writing. The gist has 5,000 stars and 485 comments. It contains no new framework, no benchmark table, no repo. It describes a folder of markdown files that an LLM keeps up to date.

Andrej Karpathy on X introducing the LLM Knowledge Bases pattern.
Karpathy's original X post, April 3, 2026. Click for the source.

The post killed my RAG pipeline in an afternoon.

I want to be honest about what that sentence means. I am not claiming RAG is dead. I am claiming that the RAG pipeline I was running against a 90,000-word research corpus, with an embedding model, a vector store, a chunker, and a LangChain-shaped retrieval layer, was strictly worse on cost, maintenance, and answer quality than pointing Claude Code at a folder of markdown. That is the claim. The rest of this post defends it.

What Karpathy Actually Proposed

The gist is three things. A raw/ folder, immutable, that holds source documents the LLM reads but never modifies. A wiki/ folder, mutable, that holds markdown articles the LLM writes and maintains, including an index.md. And a CLAUDE.md at the root that tells the agent how the vault is organized and how to update it.

That is the whole system. The LLM reads new material dropped into raw/, writes summary articles into wiki/, and back-links across existing articles. When you ask a question, the agent reads the index first, opens only the wiki articles it needs, and answers from those. No embeddings. No chunker. No similarity search.

I thought I had to reach for fancy RAG, but the LLM has been pretty good about auto-maintaining index files.

That is the line people keep quoting from his tweet, and it is the whole thesis. The vendor stack most teams built between 2022 and 2024 assumed the LLM could not be trusted to find the right material on its own. That assumption aged badly.

The Scale Where This Works

Karpathy explicitly scopes the pattern to individual researchers. His own wiki he describes at "~100 sources, ~hundreds of pages." Third-party analyses converge on ~50,000 to ~100,000 tokens of compiled wiki content as the break-even point where the pattern starts to lose to traditional RAG. Below that, the LLM reads the index and grabs three or four articles per query. Above that, context-window cost and attention fall-off start to matter again.

scale comparison
# LLM Wiki (raw/ + wiki/ + CLAUDE.md)
reads per query         = index.md + 2–4 wiki articles
embedding infra         = none
vector store            = none
chunker / re-embed      = none
token cost per query    = 1x (read only relevant articles)
maintenance             = re-run LLM lint pass
works well up to        = ~100k tokens of compiled wiki content

# Traditional RAG (embed + vector + retrieve)
reads per query         = top-k chunks via similarity
embedding infra         = embedding model + vector DB + pipeline
vector store            = Pinecone / Weaviate / Chroma / pgvector
chunker / re-embed      = required on every source change
token cost per query    = top-k × chunk size (often overshoots)
maintenance             = re-chunk, re-embed, tune k
works well above        = millions of documents with stable schema

A MindStudio analysis published around the same time benchmarked token usage for a small-scale knowledge base, and measured up to a 95% reduction against naive full-document loading. That is not a claim about RAG specifically. It is a claim about "read the index, open the right article, answer" beating "stuff everything in context." But the shape of the result matches what I saw on my own corpus, which is the part I care about.

Why the Pattern Works Now and Did Not in 2023

Three things had to land before this pattern could beat RAG.

  • 01Context window growth. GPT-3.5 shipped with 4K tokens in March 2023. Claude 3 shipped with 200K in March 2024. Current Claude and Gemini models run 200K to 1M token contexts. Read-the-whole-relevant-article became tractable.
  • 02Tool-using coding agents. Claude Code, Codex CLI, and equivalent shells made read_file, grep, and glob first-class. The agent can walk a directory the way a human would, which is what the wiki pattern assumes.
  • 03Long-horizon instruction-following. Claude Sonnet 4.5 and Opus 4.x hold a multi-step schema across dozens of file edits without drifting. Maintaining an index.md and back-links across 100 articles is now a task the model handles without a framework.

If you tried this exact pattern in 2023, you would have gotten a confused wiki with broken back-links and duplicate articles. That is not a criticism of 2023. It is the reason the RAG infrastructure stack got built in the first place. It was the correct answer to the model capability that existed then.

Where RAG Still Wins

Karpathy says this himself and it is worth repeating without softening. Above a million documents, when the wiki itself would not fit in context, traditional retrieval is still the answer. Same for any workload where documents change rapidly, where different users need access-controlled subsets, or where the corpus is adversarial and the index can be poisoned. Those cases pay for the vector DB for real reasons.

The honest size-by-audience table:

  • 01Personal research vault at up to a few hundred thousand words: wiki pattern.
  • 02Team knowledge base, static, under ten thousand documents: wiki pattern with a scheduled lint pass.
  • 03Product search over a changing product catalog: traditional RAG.
  • 04Enterprise document store with ACLs and millions of documents: traditional RAG, often with hybrid BM25 plus embedding.
  • 05Customer-facing chatbot that cannot leak across tenants: traditional RAG with tenant-scoped indexes.

The point of Karpathy's post is not that RAG is over. It is that the lower bound where RAG starts paying for itself moved up, and a lot of teams have been paying on the wrong side of that line.

What I Actually Migrated

Obsidian graph view of the LLM-maintained wiki: nodes for concepts, tools, people, and sources connected by back-links.
The same wiki in Obsidian's graph view. Every node is a markdown file the LLM wrote; every edge is a back-link it maintained.

I ran a personal research vault on roughly 90,000 words across 60-ish documents. It was embedded with text-embedding-3-small, stored in a small pgvector table, chunked at 512 tokens with 128 overlap, retrieved top-5. The retrieval was fine. The answers were fine. The cost per query was the part that bothered me, because re-embedding on source update and maintaining the chunker were carrying real ops overhead.

I deleted the embedding pipeline and the vector table. I moved the source documents to raw/. I pointed Claude Code at the folder with a CLAUDE.md describing the vault and asked it to generate wiki/ from scratch. It took about fourteen minutes on the first ingest and produced sixty-three wiki articles plus an index. Query cost dropped to what Claude Code normally costs when it opens three or four files to answer a question. Answer quality improved because the wiki articles resolve their own ambiguity with back-links, which ranked chunks never did.

I kept a branch with the RAG pipeline for two weeks to A/B test on my own queries. I have not switched back.

Who Should Care, and Why

If you run a personal research vault, a team wiki, or a founder's "second brain," this pattern is strictly better on cost and setup. You should migrate before you ship another RAG feature.

If you run a RAG product, read the gist carefully and figure out where on the scale curve your users actually sit. If most of your customers are below the hundred-thousand-token line and are paying your infra bill, you have a product-positioning problem you are about to find out about from someone on X.

If you build agent infra, the lesson is that tool-using agents plus long context eat infrastructure that assumed the model was weak. That pattern is going to repeat. Re-rankers, routers, query planners, anything whose purpose was to compensate for a limitation that no longer exists. Price accordingly.

How the Pieces Fit

The RAG industrial complex built the right stack for 2023 model capability. It kept shipping through 2024 and 2025, and vendors kept adding features while the models underneath got enough better that the features stopped being load-bearing. Karpathy's gist is not a new idea. The idea is as old as wiki software itself. What is new is that the LLM can now run the wiki.

The reason the post went to 16 million views on a platform that mostly ignores infrastructure takes is that a lot of people were already feeling this and needed someone credible to say it out loud. My RAG pipeline is deleted. My wiki is in a folder. The model is doing the job the vector database was doing, and doing it better. That is the verdict.

References