Trustworthy, Affordable Translation of Books and Historical Manuscripts at Scale

Translation & Heritage AI

Millions of digitized books and manuscripts are readable as images but inaccessible as meaning — locked behind old scripts, dead and low-resource languages, and OCR that quietly corrupts the text before anyone translates it. LLMs can now translate book-length text well enough to be useful but not reliably enough to trust blindly, and the most accurate models are the most expensive — so cost and accuracy pull against each other. This research asks how to turn scanned books and historical manuscripts into accurate, citation-traceable translations at a cost that makes translating whole libraries — not just single pages — actually feasible.

Multi-AgentMachine TranslationHistorical DocumentsCultural HeritageCost-Accuracy TradeoffProvenanceEvaluation

The crisis

Libraries have digitized millions of books and manuscripts, but a scanned page is only an image — for most people it remains unreadable when it is in an old script or a language they do not speak.
Professional book translation is slow and costly, and for historical and low-resource languages there are few qualified translators at all, so the vast majority of this written heritage is never translated.
OCR of old print and handwriting introduces errors that compound through translation: without a correction step, a fluent and confident-sounding translation can be silently wrong.
LLM translation quality is uneven across languages and the most accurate models cost the most per page, so translating an entire book forces a direct tradeoff between accuracy, cost, and carbon.
As AI increasingly mediates access to the written record, accurate, traceable, and affordable translation is what decides whether this knowledge becomes open or stays locked away.

About this research

Most of the world's digitized written heritage is unreadable to the people who would benefit from it: it sits in old scripts and low-resource languages, and the step that turns a scanned page into text can quietly introduce errors that a fluent-sounding translation then hides. This thread investigates how translation of books and historical manuscripts can be made accurate, traceable, and affordable enough to apply to whole collections rather than single pages, treating cost per accurate page as a first-class concern rather than an afterthought. It is framed as decision support for scholars and readers, not an autonomous translator: every claim should trace back to a source page, and a human keeps final say on contested output. The work draws on agentic LLM architectures, document-level machine translation, historical-document recognition, and rigorous translation evaluation. Faculty-advised.

References

OPEN — accepting collaborators

Rolling basis

Apply →

Open roles

Research Lead / Architect
Owns the research question and the system end to end — the cost-versus-accuracy tradeoff and the source-grounded, human-gated translation flow. Good if you like turning a working prototype into a real system and a publishable result.
Research Engineer — OCR & Document Pipeline
Build the early stages that turn a scanned page into clean, structured text, comparing approaches and attaching a per-page confidence score. Good if you like getting messy real-world documents to parse cleanly.
ML / Evaluation Engineer
Design the benchmark that decides how each page is handled: measure recognition error, translation quality, and cost per accurate page. Prove the system is cheaper without being less accurate.
Computational Linguist / Historical-Languages Contributor
Build per-book glossaries and validate translation quality across historical and low-resource languages, including old-print quirks. Define where model correction helps versus hurts per language, and curate ground-truth pages. Good if you care about doing justice to the source text.

Not the right role? Get notified of new openings

Only to notify you about roles. No spam.

Want to know more about this work?

Kavach — Slack: @kms22 · Email ✉
Varanjot — Slack: @vsdhillon018 · Email ✉

Or reach the lab → contact@anacodic.ai

Join our Slack →

Faculty advisor

Prof. Eugene Pinsky

← Previous: Knowledge-in-a-Box: Offline Course-Grounded AI for Low-Connectivity, Cost-Constrained Classrooms Next: Trustworthy, Uncertainty-Aware Short-Term Electricity Load Forecasting →