Translation & Heritage AI
Millions of digitized books and manuscripts are readable as images but inaccessible as meaning — locked behind old scripts, dead and low-resource languages, and OCR that quietly corrupts the text before anyone translates it. LLMs can now translate book-length text well enough to be useful but not reliably enough to trust blindly, and the most accurate models are the most expensive — so cost and accuracy pull against each other. This research asks how to turn scanned books and historical manuscripts into accurate, citation-traceable translations at a cost that makes translating whole libraries — not just single pages — actually feasible.
Most of the world's digitized written heritage is unreadable to the people who would benefit from it: it sits in old scripts and low-resource languages, and the step that turns a scanned page into text can quietly introduce errors that a fluent-sounding translation then hides. This thread investigates how translation of books and historical manuscripts can be made accurate, traceable, and affordable enough to apply to whole collections rather than single pages, treating cost per accurate page as a first-class concern rather than an afterthought. It is framed as decision support for scholars and readers, not an autonomous translator: every claim should trace back to a source page, and a human keeps final say on contested output. The work draws on agentic LLM architectures, document-level machine translation, historical-document recognition, and rigorous translation evaluation. Faculty-advised.