Building a 1.67-Million-Page Digital Library with AI

There are roughly 533,000 Latin editions printed between 1450 and 1700 catalogued in the Universal Short Title Catalogue. About 18% have been digitized. About 8% have searchable text. About 3% have an English translation of any kind. The rest — hundreds of thousands of books containing five centuries of European thought — are effectively invisible to anyone who doesn’t read Renaissance Latin.

In December 2025, I started building Source Library to change that. Less than three months later, the system has processed 4,430 books containing 1.67 million page images from 15 institutional archives, in over 30 languages. 467,000 pages have been OCR’d. 285,000 pages have been translated into English. Total AI processing cost: roughly $3,400.

The Pipeline

Every book flows through a nine-stage pipeline, each stage driven by an independent cron job. The stages are idempotent — you can re-run any stage without corrupting data — and each book transitions through 14 states from queued to complete, with automatic retry logic (three attempts before marking failed).

I had to write this list to remember them all:

Import — Pull metadata and page images from 13 different archive APIs (Internet Archive, Gallica, Bavarian State Library, Vatican, Bodleian, and others)
Archive — Cache page images to Vercel Blob storage
OCR — Gemini vision models extract text, detect images, classify page type
Metadata enrichment — Language detection, category assignment, descriptions
Translation — Gemini translates with context chaining from previous pages
Summary & index — Book-level summaries, entity extraction (people, places, concepts)
Chapters — Structural division detection
Image extraction — Lambda workers identify and catalog illustrations with museum-quality metadata
Cross-referencing — 129,000+ entities linked across the collection

The entire thing runs unattended. I check the dashboard a few times a week. Most of my time goes into adding new archive sources and refining prompts.

Why Gemini

OCR on Renaissance-era books is not a solved problem. These texts have long-s characters, ligatures, mixed Latin and Greek, blackletter typefaces, marginalia in different hands, and woodcut illustrations interrupting the text flow. Traditional OCR engines like Tesseract struggle badly.

Vision-language models handle this surprisingly well. You give Gemini a page image and a prompt that says “transcribe this text exactly as written, noting any images or diagrams,” and it returns clean text with remarkably few errors. I maintain language-specific prompts — a standard prompt, a Latin prompt that handles Neo-Latin abbreviations, and a German prompt for Fraktur and Kurrent scripts.

Gemini was cheapest. The Batch API runs at 50% of realtime pricing, bringing per-page OCR cost to about $0.001. Translation is similar. For a typical 300-page book, the full pipeline — OCR, translation, summary, image extraction — costs about $1.20.

The Dual Processing Architecture

The system runs two parallel processing paths. The realtime path uses AWS Lambda workers with SQS queues — three workers for OCR, translation, and image extraction. Pages process in seconds, at full Gemini API prices. This is for priority books or when someone’s waiting on a specific title.

The batch path uses Gemini’s Batch API. I submit jobs of up to 500 pages, they process within 24 hours (usually much faster), and a cron job collects results every two hours. This is 50% cheaper and handles the bulk of throughput.

For books longer than 500 pages, the batch system uses a parent-child architecture — the parent job splits the book into chunks, each child processes independently, and results are reassembled.

The MongoDB Connection Storm

The worst scaling problem was the database. When 600+ Lambda workers are processing pages concurrently, each one opening a MongoDB Atlas connection, writing results, and closing the connection, you get connection storms that exhaust the cluster’s connection pool.

The fix was a dedicated Writer Lambda. Instead of each worker writing directly to MongoDB, they all send results to an SQS write queue. A single Writer Lambda (capped at 50 concurrent executions) reads from the queue, batches writes in groups of 10, and handles all database operations. Connection count dropped from “unbounded” to “50, max.”

When the AI Hallucinates

With 467,000 OCR’d pages, I’ve catalogued the failure modes. Two are distinctive:

Space Flood: The model emits legitimate headers, then fills the rest of the output with spaces and dashes. I’ve seen outputs with 1.7 million characters of whitespace. The P99 for legitimate OCR output is about 9,000 characters — anything over 25,000 is almost certainly a hallucination.

Thinking Leak: The model’s internal reasoning loop bleeds into the output, creating repetitive self-talk cycles. This is rare but unmistakable when it happens.

The mitigation is straightforward: realtime processing uses a fallback chain (falling back through Gemini model versions) when safety filters trigger, and all hallucinations are logged to a dedicated gemini_usage collection for analysis. Every prompt version is immutable in the database with auto-incrementing IDs, so every page’s output is traceable to the exact prompt that produced it.

Multi-Column Detection

Many historical books have two-page spreads scanned as a single image, or multi-column layouts within a single page. Getting the reading order right requires detecting the gutter position between columns.

The system tries four approaches in sequence: dark line detection (physical gutter shadow), bright line detection (margin whitespace), center split (when geometry is obvious), and finally Gemini AI analysis (when the layout is genuinely ambiguous). Virtual crop coordinates on a 0–1000 scale allow non-destructive splitting — the original page image is never modified.

OCR prompts output <column-break/> markers to preserve reading order within multi-column pages. Get it wrong and sentences from column A interleave with column B. The whole page becomes nonsensical.

What’s in the Library

The core collection comes from the Bibliotheca Philosophica Hermetica (Embassy of the Free Mind) in Amsterdam — one of the world’s most important collections of Hermetic, alchemical, Kabbalistic, and Rosicrucian literature. But the scope extends far beyond esotericism: Greek philosophy, Roman law, Islamic science, Chinese statecraft, Indian mathematics, Reformation theology, early modern astronomy and medicine.

The collection spans 30+ languages. The top five: Latin (975 books), English (828), Chinese (533), German (524), and Greek (428). The oldest texts date to antiquity; the newest to the 18th century.

71,000 illustrations have been catalogued with AI-generated descriptions. The image extraction pipeline identifies woodcuts, diagrams, tables, and decorative elements, tagging each with subject matter, style, and context.

The MCP Server

Source Library publishes an MCP server (Model Context Protocol, v4.0) with seven tools that let any AI system search and read the library programmatically. You can search across titles, authors, translations, and OCR text; find what historical authors wrote about specific topics; read 50+ pages of text in a single call; and query the illustration catalogue.

Every tool returns structured data with citation URLs — if an AI finds a passage, it can link directly to the page on sourcelibrary.org where that passage appears. Other systems can build on top of it.

Cost and Scale

Some numbers that still surprise me:

Total AI processing cost to date: ~$3,400 for 2M+ Gemini API calls
Per-book pipeline cost (batch): ~$1.20 for a 300-page book
At current rates, $40K in API budget would process roughly 40,000 books through the full pipeline
A single human translator might produce 5–10 pages of scholarly translation per day. This system processes thousands of pages per hour.

The translations are not publication-grade. They’re research-grade — accurate enough for discovery, for finding the passage you need, for understanding argument structure, for deciding whether a book warrants serious scholarly attention. The difference matters. None of this replaces translators. It builds the index that tells them where to look.

Why Now

This works right now because of a coincidence:

AI translation is good enough for research-grade discovery
Source material is digitized — 400,000+ pre-modern texts have been photographed by archives worldwide
Scholarly expertise still exists — but specialists in Latin, Syriac, Classical Chinese are retiring faster than they’re being replaced

The window where we have both the technology to process these texts and the scholars to validate the output is narrowing. Source Library is a bet that building the infrastructure now, while both exist, is worth the effort.

How It Was Built

Source Library was built almost entirely through Claude Code sessions — 371 sessions and over 51,000 prompts from December 2025 through March 2026. The prompts read like a project diary:

371 sessions · 51,533 prompts

NovDecJanFebMar

STTS

Less

# Dec 25, 2025 — the vision
> fully automated workflow for book import to split
> (if necessary) to ocr to translate to summary to
> publication review to doi

# Dec 27 — scaling ambitions
> we will have a billion books with hundreds of pages

# Dec 27 — curating by candlelight
> demonologie by king james

# Dec 30 — going deep
> oh, we need the bible of course... I guess we should
> find an early greek bible? Hebrew? First in Latin?

# Dec 31 — chasing fragments
> Any byzantine sources with orphic histories? I bet there
> are a bunch of unread greek sources that make reference

# Jan 1, 2026 — milestone
> let’s get to 1000 books in the collection

# Jan 2 — scolding the AI
> “What AI can’t do: Actually understand”???
> what are you on... you can do a lot better than this.
> Do you need to meditate or recenter?

# Feb 16 — launch anxiety
> I want to put out a press release. So that we don’t
> get scooped. I dont’ know how to do press releases
> or any of this stuff

# Feb 27 — the patron
> I think the /letter to patrick collison needs more
> fact checking

The “billion books” prompt was typed on Christmas Day. The demonology, the Orphic fragments, the early Bibles — the collection grew the way any library grows: by following curiosity at midnight. Most of the system emerged the same way — a problem noticed, a prompt typed, a solution built in the same session.

You can watch a compressed replay of a Source Library building session at codevibing.com.

The library is live at sourcelibrary.org. Everything is open. The MCP server is on npm. If you read Latin, we could use your eyes.