Back to All Events

Melissa Dell “Efficient OCR for Building a Diverse Digital History”

Thousands of users consult digital archives daily, but the information they can access is unrepresentative of the diversity of documentary history. The sequence-to-sequence architecture typically used for optical character recognition (OCR) – which jointly learns a vision and language model - is poorly extensible to low-resource document collections, as learning a language-vision model requires extensive labeled sequences and compute. This study models OCR as a character level image retrieval problem, using a contrastively trained vision encoder. Because the model only learns characters’ visual features, it is more sample efficient and extensible than existing architectures, enabling accurate OCR in settings where existing solutions fail. Crucially, the model opens new avenues for community engagement in making digital history more representative of documentary history.

Time: 12:00 - 1:00 PM ET | 9:00 - 10:00 AM PT | 6:00 - 7:00 PM CET.


Professor Dell is the Andrew E. Furer Professor of Economics at Harvard University. She is the 2020 recipient of the John Bates Clark Medal, awarded each year to an American economist under the age of forty who is judged to have made the most significant contribution to economic thought and knowledge. In 2018, The Economist named her one of the decade’s eight best young economists, and in 2014 she was named by the IMF as the youngest of 25 economists under the age of 45 shaping thought about the global economy. Her research focuses on economic growth and political economy. She has examined the factors leading to the persistence of poverty and prosperity in the long run, the effects of trade-induced job loss on crime, the impacts of U.S. foreign intervention, and the effects of weather on economic growth. She has also developed deep learning powered methods for curating social science data at scale, released in the open-source package Layout Parser. This work supports many of her current projects, which rely on digitizing historical sources far too large for manual digitization. Professor Dell is a senior scholar at the Harvard Academy for Area and International Studies and a research associate at the National Bureau of Economic Research. She received an AB in Economics from Harvard in 2005, an MPhil in Economics from Oxford in 2007, and a PhD in Economics from MIT in 2012. Before joining the Harvard Economics department in 2014, she was a Junior Fellow at the Harvard Society of Fellows.

Previous
Previous
April 17

Swapna Reddy: “Participatory Decision-Making: How 600,000 Asylum Seekers Work Together to Make Change”

Next
Next
May 3

David Adelani: “How Good are Large Language Models on African Languages?”