- Ambiguous annotation
- How to encode characters unambiguously
- How to encode annotation unambiguously
- Choosing software
Annotation is information about linguistic data. It typically represents information which can be called "grammatical" -- e.g. morphological, syntactic, semantic, discourse and pragmatic information -- as perceived through the eyes of the linguist who collected the data.
To ensure long-term intelligibility of this data, characters must be coded unambiguously. This is done by using an encoding system that maintains a one-to-one correspondance between code-points and the characters they represent. Also, the annotative terminology must be transparent. This is done by linking the terminology used to a generally accepted ontology. This page addresses these two concerns of clarity.
In a print document there can be no difference between how a text looks and how it is represented. In a digital document this distinction is crucial, and is one of the major reasons for lack of intelligibility in digital material. The clearest example of this is to be found with fonts. Older fonts use a maximum of 256 unique code-points to represent a character. Since there are considerably more than this number of unique Latin characters, different fonts use the same code to represent different characters. This means that a change in the font used to represent a text could make the material incomprehensible.
Annotation can also be ambiguous, but in other ways. For example, the term "nominative" is used in some Australian material to indicate the case taken by the subject of an intransitive clause. In Indo-European linguistics it is used to indicate the case taken by the subject of any clause. The opposite problem also occurs, i.e., the same term may be used to describe different language structures (as when "absolute" is used to indicate a nonpossessed form in Semitic, but a transitive object/intransitive subject in ergative languages).
In order for character representations to be unambiguous, they must use a universal encoding system that assigns a unique code-point for each character it represents. Such a system exists in Unicode. Unicode is a world-wide standard in which each character is given a 21-bit value. The Unicode ISO standard (ISO-10646-1) allows for 2,147,483,648 unique code points, although only 96,248 are assigned in Unicode 4.0. This encoding strategy provides a unique code-point for (ultimately) every character ever used in the world's languages. Thus Unicode is unambiguous, and is the encoding standard universally recommended for material which long term intelligibility is critical.
For annotation to be digitally unambiguous, we need to approach the problem in a different way from that of fonts. It is unreasonable to expect all linguists to use exactly the same term for the same linguistic phenomenon in their work. There are two reasons for this. First, linguists should be able to use the terms that make most sense to them. And, second, different traditions of linguistic study use different terms. It would be confusing to change these.
Although the use of a single terminology set is an unattainable and probably undesirable goal, the lack of uniformity presents serious difficulties for both humans and machines. Current and future linguists may find the documentation difficult to interpret in and of itself, since they must first learn new terminology before they can understand a new body of data; and if two bodies of data have incompatible markup, the datasets will be difficult to compare.
Moreover, a multiplicity of terminology sets impedes computational retrieval and analysis of the data. It inhibits machine searching for similar linguistic structures, since no search-engine can be expected to "know" that differently named entities are equivalent. And it hinders the development of general tools for language analysis, since each dataset requires different software. Machine-readable documentation defining and correlating the items in different terminology sets would alleviate these problems. But rarely, if ever, is EL documentation accompanied by such exposition.
The solution is to leave linguists to use any term-set they wish, but link these systematically to a commonly accepted, well-documented term set, such as an ontology of linguistic terms, developed by linguists but interpretable by machines. Such an ontology is under development: GOLD. This ontology, called GOLD (General Ontology for Linguistic Description), is a computational statement of relationships between possible linguistic features.
There are a variety of transcriptional and annotation software options to choose from, and many criteria to consider in the decision. Linguists must think beyond their immediate purposes for data and use software that will facilitate diverse use of the data in the future. However, software should also cater to the immediate goals of the linguist.