← Back to Research
Trustworthy, Affordable Translation of Books and Historical Manuscripts at Scale

Trustworthy, Affordable Translation of Books and Historical Manuscripts at Scale

Translation & Heritage AI

Millions of digitized books and manuscripts are readable as images but inaccessible as meaning — locked behind old scripts, dead and low-resource languages, and OCR that quietly corrupts the text before anyone translates it. LLMs can now translate book-length text well enough to be useful but not reliably enough to trust blindly, and the most accurate models are the most expensive — so cost and accuracy pull against each other. This research asks how to turn scanned books and historical manuscripts into accurate, citation-traceable translations at a cost that makes translating whole libraries — not just single pages — actually feasible.

Multi-AgentMachine TranslationHistorical DocumentsCultural HeritageCost-Accuracy TradeoffProvenanceEvaluation

The crisis

  • Libraries have digitized millions of books and manuscripts, but a scanned page is only an image — for most people it remains unreadable when it is in an old script or a language they do not speak.
  • Professional book translation is slow and costly, and for historical and low-resource languages there are few qualified translators at all, so the vast majority of this written heritage is never translated.
  • OCR of old print and handwriting introduces errors that compound through translation: without a correction step, a fluent and confident-sounding translation can be silently wrong.
  • LLM translation quality is uneven across languages and the most accurate models cost the most per page, so translating an entire book forces a direct tradeoff between accuracy, cost, and carbon.
  • As AI increasingly mediates access to the written record, accurate, traceable, and affordable translation is what decides whether this knowledge becomes open or stays locked away.

About this research

Most of the world's digitized written heritage is unreadable to the people who would benefit from it: it sits in old scripts and low-resource languages, and the step that turns a scanned page into text can quietly introduce errors that a fluent-sounding translation then hides. This thread investigates how translation of books and historical manuscripts can be made accurate, traceable, and affordable enough to apply to whole collections rather than single pages, treating cost per accurate page as a first-class concern rather than an afterthought. It is framed as decision support for scholars and readers, not an autonomous translator: every claim should trace back to a source page, and a human keeps final say on contested output. The work draws on agentic LLM architectures, document-level machine translation, historical-document recognition, and rigorous translation evaluation. Faculty-advised.