Steven Moran & Shauna Eggers, University of Washington

APEC: A tool for automatic phonological analysis of field data

The impending extinction of half of the world's languages within the next one hundred years is a well known crisis that is of central concern to Linguistics. To foster the rapid documentation of lesser known languages, this paper will demonstrate APEC (Automated Phonetic Environment Creator), a tool that has been developed by documentary and computational linguists for automatic phonological data analysis. APEC takes digitized field data as input and outputs the phone environments of the linguistic forms and their occurrence frequencies. Additionally, the tool automatically finds minimal pairs, suggests hypotheses for phonemes and allophones, and supports an interactive exploration of the data with a "pattern-matching" option, based on either IPA symbols or distinctive features provided by the Phonetics Ontology. The results of the analysis may be output in a variety of formats that may be exported into other software tools, such as MS Excel.

APEC is useful for rapid development of phonemic inventories, and has been tested on the previously undocumented language Western Sisaala (Niger-Congo; Gur). It is our aim that tools like APEC will automate laborious tasks of the field researcher, while also providing insight for linguistic analyses. In the case of Western Sisaala, weeks were spent in the field collecting data and developing a phonemic inventory by charting out phonetic environments and analyzing minimal pairs. We believe that APEC will automate much of this process, allowing researchers more time to collect data and test hypotheses while in the field. Currently, APEC's analyses are being compared to (the manually derived) Western Sisaala researcher's hypotheses. This evaluation metric will be discussed and the details of our findings presented.

We will also discuss the computational modeling of linguistic problems that we encountered in APEC's development. For example, we were faced with encoding robustness for variations in field linguists' transcriptions, e.g. some linguists transcribe long vowels as a:, others as aa. We also faced the problem of distinguishing between consonant clusters that represent complex sounds, e.g. the voiced labial-velar gb (found in many West African languages), and consonants that are arbitrarily next to each across syllable boundaries, e.g. lag'ba1. Furthermore, we will discuss the challenges of adapting the Phonetics Ontology for use in tools for field linguists.

This paper will address other large scale concerns for developing software for documenting languages. These include criteria for good software design, making the code open source so that others can use and build upon it, and documentation of the software's architecture and functionality for future accessibility.


1. The example lag'ba does not occur in our data sample, but we had to encode the possibility of it occurring.