29 Jun 2011

Archiving Every Language Ever Spoken: The Rosetta Project


The Rosetta Project is a programme of The Long Now Foundation—a project which sets out to document every human language currently in use in the creation of a contemporary Rosetta Stone. The idea behind the archival venture is to create a publicly accessible digital library of human languages in collaboration with language specialists and native speakers.

The Rosetta list totals some 7,000 languages, all of which the project hopes to protect from disappearance without documentation through preservation on the Rosetta Disk. The Disk currently holds 13,000 microetched pages of word lists from 1,500 languages. The Rosetta website describes the project as The Long Now Foundation’s ‘ first exploration into very long-term archiving. It serves as a means to focus attention on the problem of digital obsolescence, and ways we might address that problem through creative archival storage methods.’

The three inch Disk features nearly 14,000 pages of microscopically etched information, able to be read by the human eye using optical magnification. The disk functions as a keystone for parallel archived information stored by the project as part of the Internet Archive.

The project organisers will be holding a ‘Record-a-thon’ on 30th of July in San Francisco, where video capture of live speakers in the Bay area will hope to archive evidence of more than a 100 languages through the telling of stories and conversations.

Recordings of spoken languages will be captured ideally in video, which, as director Laura Welcher explains, offers a ‘richer source of language information--it also documents context (to help with what the speaker is talking about, or pointing to), speech participants (like a conversation, or public speaking event), as well as the speaker's body and facial gestures.’

Interviewed by email for FastCompany.com, Welcher explained the intended use of this kind of archived data for researchers: ‘A corpus can be used in many different ways--a small corpus can provide language learning and teaching materials, as well as materials for the building of linguistic resources such as grammars and dictionaries (this is the kind of language documentation linguists are producing today). Then, with a larger corpus--say tens of hours of transcribed speech, we can start building acoustic models for speech recognition. With a few million words we can start to do machine translation. And these are the tools that enable a language to be used online--which I would argue is a crucial new domain for language use in the modern world.’

For a full transcription of the interview and article by Matthew Battles, click here.