I was recently talking with a university librarian about ways that he could digitize some of his collections. In our conversation, I brought up two “crowdsourcing” projects I knew of where librarians and archaeologists were soliciting help from the public in order to digitize large bodies of text documents from both a library collection and large archaeological projects.
The output of digitization projects, machine-readable documents, make fast searching and indexing of archaeological information possible, as well as allow researchers to conduct text-mining analyses that extract patterns from the digitized information. Actually digitizing past documents, especially handwritten documents, is tedious work however, and many organizations do not have the resources to complete this work. While archaeologists are collecting more and more data digitally, a vast body of archaeological information exists only in paper: books, reports, articles, forms, and field notebooks. Even if we assume that anything published may have been digitized by the publisher (if they still exist), that still leaves a large body of archaeological data and analyses trapped in paper. One way archaeologists and others can digitize paper documents is to use optical character recognition (OCR) software to analyze scanned pages and match image shapes to text characters and words. While OCR programs have improved a lot over the years, the programs still make errors, especially when converting text that uses specialized language and OCR programs are generally pretty bad at recognizing handwritten characters. Often, humans need to carefully proofread and double check OCR transcriptions to ensure the text matches the original documents. Several organizations are pursuing crowdsourcing projects to invite volunteers to help verify digitized documents or even transcribe documents too complex for OCR software to digitize.
There are two projects that I recommended to the librarian: the first is a crowdsourcing effort by the British Museum and the University of Pennsylvania Museum to digitize documents from their 1922-1934 excavations at the Mesopotamian city of Ur with the Iraqi Department of Antiquities. The project, known as “UrCrowdsource” uses an Omeka content management system (CMS) and website platform to index JPEG scans of fieldwork documents such as letters, ledgers, and fieldnotes. UrCrowdsource’s website uses a plugin called Scripto that allows volunteers to pick document images and record their transcriptions. While the aesthetics of the UrCrowdsource webpage are basic, the website includes several essential features, such as a login system to track and give credit to the transcriptionists, basic and advanced search functions, and a “Terminology” page that gives users a glossary of names of fieldworkers, annotation guidelines, a coding sheet of abbreviations, and most critically: a list of jargon words that may be familiar to a Mesopotamian archaeologist but unintelligible to a volunteer sitting at a computer in the Midwest or Hong Kong.
The second project is the “DigiTalkoot” (Finnish for “Digital Volunteer”), project by the National Library of Finland, which combined crowdsourcing with games to digitize documents from their newspaper archives. Aiming to make an “Angry Birds for Thinking People,” the National Library partnered with another organization, MicroTask to create casual games that gave volunteers the opportunity to assist with verifying and digitizing historic newspaper articles. MicroTask and the Library created two games that gave users small pieces of text, usually between one and two words, to assess how accurately a computer OCR program had digitized the text. In the first game, Mole Hunt (Myyräjahti), moles pop out of the on-screen field holding signs with scanned images of words along with an OCR transcription. Players have to decide whether an OCR transcription of a word matches the scanned image of the word to make the mole disappear. At the end of each round, the game scores players and awards points for correct answers. In Mole Bridge (Myyräsilta), players type out the transcription of a scanned image of a word in order to “build” a bridge that the moles can use to cross over a river. Through these games, players verify OCR transcriptions and also transcribe new words too difficult for OCR to recognize. The DigiTalkoot website lists the top ten volunteers along with each volunteer’s number of completed tasks and hours spent playing. The top volunteer, Petri M., completed 348,422 tasks and spent 395 hours playing the two games! Impressively, by making digitization an engaging and fun task, DigiTalkoot convinced nearly 110,000 participants to complete over 8 million word fixing tasks by the end of the project in November 2012. The Library has integrated the fixed words from the project into the Historical Newspaper Library of The National Library of Finland. The Library feels the project was so successful that they are continuing their crowdsourcing efforts with a new project to annotate newspaper articles, “Kuvatalkoot,” in late 2013.
Each of these crowdsourcing projects offers different benefits to their volunteers in return for their services. UrCrowdsource gives volunteers a view into the fieldwork practice through making documents public that most people would never otherwise see. In the case of the Ur documents, I found myself quite enjoying reading through narratives in the field notes while transcribing. The value for the volunteers is much clearer in the DigiTalkoot project. In addition to having fun “whacking” or saving moles, players also can feel that they are helping accomplish a larger goal through their play that benefits the Finnish country and enhances Finnish cultural heritage.
I hope more archaeologists and archivists will consider using innovative crowdsourcing projects like these to involve the public and help digitize documents. If you know of any other digitization projects in archaeology, please post them in the comments below!
DigiTalkoot Project http://www.digitalkoot.fi/
National Library of Finland Turns to Crowdsourcing (Games)
Omeka [http://omeka.org/] content management system (CMS)
UrCrowdsource Digitization Project