Sáliba Case Study:
From Cassette to the Web
- Audio digitization
- Image digitization
- Text digitization
- Metadata creation
- Web presentation
- Follow the path of the Sáliba data
During field work in 1996, SIL International linguist Nancy Morse collected a standardized wordlist of 375 items, representing the variety of Sáliba spoken near the Casanare River in Colombia. The words were filled in by hand on a 15-page form in Americanist phonetic transcription, with glosses in Spanish and English. Morse recorded 40 minutes of two Sáliba men on a portable cassette recorder, with Saúl Humejé reading the Spanish prompts, and Angel Eduardo Humejé speaking the Sáliba words.
Dr. Gary Simons and Dr. Paul Frank of SIL International have digitized these materials. Their first consideration was the question of portability of and access to the digital language materials. This includes the synchronic portability of resources across a multiplicity of present-day computing platforms as well as the diachronic portability of today's resources to computing platforms of the future. Bird and Simons (2003) identify seven dimensions of portability and propose best practice guidelines designed to maximize the ability of digital language resources to remain usable far into the future.
Thus, Dr. Simons and Dr. Frank digitized the Sáliba materials with the following goals:
- To make archival-quality digital versions of the audio recording and the wordlists for long-term preservation.
- To make the Sáliba data available in an easily accessible version on the Internet.
- To create and publish standardized metadata for the digital files.
Dr. Simons and Dr. Frank began by making an archival-quality digital version of the audio recording. Lack of hardware prevented them from digitizing the audio recordings at the recommended best practice rate of 96,000 Hz and a bit depth of 24. Instead, the recordings were digitized at a sample rate of 44,100 Hz and bit depth of 16. While the former rates are recommended for archival purposes, they may be difficult to achieve without specialized equipment. Dr. Simons and Dr. Frank stored the file in the recommended best practice WAV file format. To make the digital audio available for web browsing, they created a small WAV file of each of the Sáliba responses. Because WAV files are uncompressed,they usually are large and take a long time to download, so MP3 files are more commonly usedfor web presentation. However, the small size of the Sáliba sound clips made the WAV format suitable for web access in this case.
Dr. Simons and Dr. Frank produced two sets of image files of the handwritten wordlists: a master, or archival, set and an access, or presentation, set. The master images were scanned at the recommended resolution of 300 dpi, in an 8-bit grayscale. The original size of the documents (8.5" x 11") was preserved, and the images were saved in uncompressed TIFF format. The access images were scanned at 75% of the original size, at 72 dpi with an 8-bit grayscale. These images were saved in an 8-bit interlaced GIF format for presentation on the web.
Dr. Simons and Dr. Frank created an archival-quality digital encoding of the transcribed wordlist as an XML file with descriptive markup tags. XML is an open standard of the World Wide Web Consortium that is based on extensible tags (extensible meaning that they are not pre-programmed, but can be defined by the creator). XML is currently considered best practice for the archival encoding of textual data, because it does not depend upon any particular software. Furthermore, it is generally more self-descriptive than other electronic formats, which should make it more accessible to future generations. The Sáliba wordlist file captures the Spanish and English prompts, the phonetic transcription in IPA (using Unicodefor the encoding), the additional notes, and the start and stop times in the digital audio files for each Sáliba utterance.
Tne next step was to create a metadata description of this set of archival materials. Metadata is information about resources. It is similar to card catalog information in a library -- it enables discovery and retrieval of resources. In this case, the resource description also states the materials are copyrighted and available to all under standard terms of Fair Use. Although there are a number of metadata standards currently in use, only two were developed specifically to describe language resources: the OLAC metadata standard and the IMDI metadata standard. Dr. Simons and Dr. Frank chose the simpler OLAC metadata standard. It is stored in XML format, which is easily transformed into other formats through the use of XSL stylesheets.
Finally, the Sáliba data was presented as a web page with integrated images, recordings, and metadata. An XSL script was developed to generate an HTML presentation form for viewing the digitally encoded wordlist, providing links to display the digital images of the original transcriptions and to play back the recording of each utterance. XSL stylesheets are used to transform XML documents into different file formats (for instance, HTML, text, or PDF), without changing the original XML document. This demonstrates the portability afforded by best practices. The presentation form was published on the SIL International website to enable linguists to inspect the data over the Internet. Access is ensured by the publication of a presentation form of all language materials both on the web and on CD-ROM. If you would like to read a thorough description of the digitization process of the Sáliba data, please see Frank and Simons 2003.
- Get Started: Summary of the Sáliba conversion
- Digitize Audio: Digitizing Audio page (Classroom)
- Digitize Images: Digitizing Images page (Classroom)
- Digitize Text: XML page (Classroom)
- Create Metadata: Metadata page (Classroom)
- Web Presentation: Stylesheets page (Classroom)
|About the Data|
Search the Lexicon
|About the Language|
|User Contributed Notes
E-MELD School of Best Practices: From Cassette to the Web: Sáliba
|+ Add a comment|
|+ View comments|