This section is designed for linguists who have data in one format, physical or digital, and wish to convert it to another format. Conversion is a complex topic, and can be divided into at least three categories:
- Converting characters from one character set to another
- Converting linguistic data from one program (e.g. Shoebox) into another (e.g. FIELD)
- Converting audio or video data into another audio or video format
This section will deal with each of these in turn. For all types of conversion, however, the following principles are helpful:
- Try to find someone who has experience in the kind of conversion you wish to do. If you can't, do as much research as you can on that kind of conversion before you begin.
- The more neutral you make the format, the more likely it is that your data will survive for the long term.
- When you convert, be consistent. Variation is hard for others to interpret.
- If there's additional information pertaining to the language resource you're converting, make sure you preserve it along with the resource.
- The best available copy of your converted data should be archived according to best practices.
There is only one kind of character conversion which makes sense in the current digital climate, and that is conversion into the Unicode standard. This standard already contains almost all the characters linguists need, and will, in its later versions, contain them all. Thus there is a simple rule linguists should optimally follow: Convert your textual data into Unicode. Other encodings, such as ascii, are acceptable, but do not support as many languages as Unicode.
Character conversions are difficult for linguists, however. There are numerous utilities in existence for converting character sets to another. The Unix utility iconv will convert character sets between ISO 8859-X and Unicode, and the C3 system, developed by the Trans-European Research and Education Networking Association (TERENA) will convert between European character sets. But such conversion facilities are rarely of use to linguists, since they are designed for the conversion of standard sets of characters to other standard character-sets. They will thus convert Cyrillic into Latin, for example, or ISO-8859-X into Unicode. But linguists have in the past represented IPA either by using the non-Unicode encodings defined by such fonts as IPAKiel or the SIL font-suite, or with the (X)SAMPA alphabet in ASCII. Many have simply used arbitrary characters they themselves selected. These are hard to convert into Unicode, simply because of their arbitrariness.
The best comprehensive character conversion facility existing so far is one produced by SIL, called TECKit. However, this is a complex piece of software, and it requires some skill to use. It can also be modified to incorporate new mappings, but this is not easy to do. If you're interested in trying to do this yourself, there is a useful tutorial on the SIL site here. For the ordinary user, however, it is probably easier just to use a Unicode-aware piece of word-processing software like Word 2000 or XP, and globally replace characters by hand. The E-MELD project is currently developing a utility which will allow you to do simple mappings from one character set to Unicode; but as yet it is not ready for general use.
If you've been storing your data in one program -- FileMakerPro, for example, or even Word -- and wish to move your data to another -- perhaps more useful -- program, there is as yet no straightforward way to do it. What conversions you do will depend upon where you are coming from, and where you are going. The E-MELD project is currently developing utilities which will take data files from programs which linguists commonly use -- in particular Shoebox, Excel and FileMakerPro -- and convert them to a standard XML format, which can be read into XML aware programs such as E-MELD's FIELD lexical analysis tool-set. But as yet these are not ready for general use.
To preserve the integrity of your audio and video data, the best rule of thumb is, don't convert. This is obviously seldom practical, since magnetic media deteriorate over time, and the equipment needed to play them often become obsolete even sooner. Therefore, it is recommended that you always preserve the data in its original form and maintain a conversion history for all data. This way, any loss of information during conversion is fully documented and can be traced back later. When you have to convert, do the absolute minimum number of format conversions you need, for some degree of information loss is almost inevitable in conversion from analog to digital formats.
To make a digital copy of analog data:
- Connect your analog player (e.g. cassette player) to a digital recording device, using the appropriate cable. If you wish to create an archival quality copy, choose the digital recording device according to our hardware recommendations.
- If you do not have access to these recommended devices, but need to create a working digital format, you can record onto your computer. However, sound cards that come standard on computers may not record archival quality data.
- Play the analog recorder, capturing the output into your digital recording device.
- With some devices (e.g. a computer), you may need to use audio or video recording software.
- Follow the recommendations on our audio pages to ensure the highest quality audio recording possible.
- Follow the recommendations on our video pages to ensure the highest quality video recording possible.
- Audio and Video recording software can be found in our software database
Digital-to-digital conversion is lossless when done properly. In order to convert from one digital format to another:
- Determine which format your audio data is already in, and choose which format you want to convert it into
- Find conversion software that provides lossless conversion
- Search for audio and video conversion software in our software database
- Follow the instructions in the software documentation to convert your data
Digital conversion from recent audio and video formats is relatively simple. However, professional expertise is needed for conversion of materials on older media, such as audio wire recordings or wax cylinders, or nitrate or cellulose acetate film. If you have this sort of media, look for help from librarians or archivists at your institution, to avoid ruining irreplaceable recordings.
Some good sources on conversion include the following:
- Library Digitization (Library Preservation at Harvard)
- Technical Recommendations for Digital Imaging Projects (Image Quality Working Group of ArchivesCom)
- Is Digital Conversion Really Part of Preservation?
- Creating Digital Content: Digitization (pdf)
- Chilin Shih's E-MELD presentation on Sound Conversion (Powerpoint)
- LSA Presentation on Resource Conversion (Powerpoint)