I was recently talking with a university librarian about ways that he could digitize some of his collections. In our conversation, I brought up two “crowdsourcing” projects I knew of where librarians and archaeologists were soliciting help from the public in order to digitize large bodies of text documents from both a library collection and large archaeological projects.
The output of digitization projects, machine-readable documents, make fast searching and indexing of archaeological information possible, as well as allow researchers to conduct text-mining analyses that extract patterns from the digitized information. Actually digitizing past documents, especially handwritten documents, is tedious work however, and many organizations do not have the resources to complete this work. While archaeologists are collecting more and more data digitally, a vast body of archaeological information exists only in paper: books, reports, articles, forms, and field notebooks. Even if we assume that anything published may have been digitized by the publisher (if they still exist), that still leaves a large body of archaeological data and analyses trapped in paper. One way archaeologists and others can digitize paper documents is to use optical character recognition (OCR) software to analyze scanned pages and match image shapes to text characters and words. While OCR programs have improved a lot over the years, the programs still make errors, especially when converting text that uses specialized language and OCR programs are generally pretty bad at recognizing handwritten characters. Often, humans need to carefully proofread and double check OCR transcriptions to ensure the text matches the original documents. Several organizations are pursuing crowdsourcing projects to invite volunteers to help verify digitized documents or even transcribe documents too complex for OCR software to digitize.