Biao Min Case Study:
From Notecards to the Web
- Image digitization
- OCR or keyboard entry?
- Text digitization
- Text storage
- Text presentation
- Metadata creation
- Follow the digitization path of the Biao Min data:
For four months in 1982, while at the Central Nationalities Institute under a Graduate Program Fellowship, David Solnit collected all the existing field data on Biao Min. His collaborator and informant was a research fellow affiliated with the institute, Mr. Deng Fanggui ( 鄧 方 貴 ). Mr. Deng, a native speaker of Biao Min, was 52 years old at the time and was from the village of swəl:7 lyɔŋ2 ( 雙 龍 ) in the Quanzhou County ( 全 州 ) of the Guangxi Province. David Solnit wrote down the Biao Min data on notecards, one word per card. Shown below are three examples of David Solnit's actual notecards. To view an enlarged version of a notecard, click on the image.
These notecards were then put aside, in a closet, for over a decade. In 2001,
Dr. David Solnit donated them to the E-MELD project. With the assistance of Dr. Martha Ratliff, the graduate research assistants of the Linguist List have been digitizing the Biao Min notecards.
Because this is the only documentation of Biao Min known to exist, simply entering the information on the notecards into a database is not sufficient. In digitizing documentation of endangered languages for long-term preservation, researchers must ask themselves, "Will future linguists find it valuable to have access to images of the notes themselves?" There are cases in which marks in field notes have led later analysts to a reinterpretation of the data. For this reason, it is important to create and archive digital images of the notecards.
The process of image digitization requires making decisions on scanning parameters and storage formats. Three different types of image formats may be created in the process of digitization: the archival format (master copy), the presentation or access format, and a thumbnail version of the image.
The Biao Min notecards are being scanned with the following settings:
- Bit depth: 8-bit grayscale. We chose not to use color (24-bit or higher), because the cards are written in pencil and ink, in shades of gray (continuous tone) but without any color differences (the cards themselves are a faded buff shade, but the color of the cards is not important information for our needs).
- Resolution: 400 dpi (dots per inch), in order to capture the smallest significant strokes on the cards. Ideally, we might have used 600 dpi, or even higher, but space limitations forced us to choose the setting that provided sufficient clarity without using an unreasonable amount of storage.
- Contrast: The scanner was set to 50% contrast, in order to improve the quality of the image. We found that increasing the contrast made a greater difference in legibility than using a higher resolution would have done.
These decisions were based on the importance of clearly understanding the intellectual content of what was written on the cards, to serve the goal of preserving the lexicon of an endangered language. If we had been scanning other materials, such as artwork, or early versions of a rare and unusual orthography, or index cards annotated in differently colored inks, we might have chosen color and a higher resolution. Conversely, when scanning printed or typed text, a lower resolution and bitonal (1-bit) images may suffice. Scanning parameters need to vary according to the characteristics of the objects being scanned.
The resulting master images are being archived in TIFF format. TIFF format is the optimal format for storage; it is an uncompressed format, which does not lose information.
TIFF images are extremely large (approximately one megabyte per single image). Large files take a long time to download to a PC, making them impractical for presentation on the web. Therefore, copies of the Biao Min notecard images are being compressed in GIF format for display purposes. GIF uses a lossless compression algorithm and a limited (8-bit) color palette to reduce file size. To view the difference between these formats, click on the thumbnail images below.
1) GIF image of a Biao Min notecard
|Image Details: Width: 1484 pixels ; Height: 878 pixels; Bit Depth: 8 bits per pixel; Color Representation: Palettized; Compression: Lempel-Ziv; Size: 580 KB|
2) TIFF image of a Biao Min notecard
|Image Details: Width: 1480 pixels; Height: 876 pixels; Bit Depth: 8 bits per pixel; Color Representation: Palettized; Compression: Lempel-Ziv; Size: 614 KB|
Thumbnails are frequently created for presentation. Thumbnails are usually GIF images that have been reduced in size, making it possible to display small, clickable versions of images on a single webpage. They are especially useful for accessing images that are difficult to describe in words, such as artwork or photographs. However, the most logical way to access the Biao Min notecards will be by linking them to the entries in the FIELD lexcical database discussed below. Therefore, although we have created thumbnails for a few Biao Min notecards to be viewed on these pages, it is unlikely that we will produce thumbnails for each of the thousands of cards in the lexicon.
Creating digital images of the cards was the first step; the second was finding a way to preserve the textual information. There are two ways to digitize text: Type it in, or run an OCR application to convert the images into characters. Unfortunately, OCR works relatively well for printed or typed text, but is not yet available for handwritten notes. Therefore, the Biao Min lexicon would need to be entered into some sort of database.
Since OCR is not suitable for handwritten documents, the research assistants of Linguist List began the time-consuming task of manually entering all of the data from David Solnit's notecards into a database using the FIELD tool. FIELD has been developed specifically for entry of lexical data in best practice format; it is Unicode-compliant, and has the ability to output the data as an XML document.
XML stands for eXtensible Markup Language. It defines a standard way of encoding the structure of information in plain text format. It is an open standard of the World Wide Web Consortium that is based on extensible tags (extensible meaning that they are not pre-programmed, but can be defined by the creator). XML is currently considered best practice for the archival encoding of textual data, because it does not depend upon any particular software, and can be formatted through an XSL Stylesheet to be displayed in almost any format. Furthermore, it is generally more self-descriptive than other electronic formats, which should make it more accessible to future generations.
Stylesheets can be used to transform XML documents into different file formats (for instance, HTML, text, or PDF). Using an XSL processor, it is possible to transform an XML document via multiple XSL stylesheets which will display the information in multiple formats without changing the original XML document. Thus, a stylesheet could transform the same lexicon in XML into a learner's dictionary or an academic dictionary, in online or printed versions.
Metadata is information about resources. In this case, it is information about language resources: lexicons, audiotapes, transcribed texts, language descriptions, video recordings, etc. It is similar to card catalog information about library resources -- it enables discovery and retrieval of resources through standardized information.
- Get Started: Summary of Biao-Min Conversion
- Digitize Images: Digitizing Images page (Classroom)
- OCR or Keyboard Entry: OCR or Keyboard page (Classroom)
- Digitize Text: Lexical Analysis page (Workroom)
- Store Text: XML page (Classroom)
- Present Text: Stylesheets page (Classroom)
- Create Metadata: Metadata page (Classroom)
|About the Data|
OCR or Keyboard
Search the Lexicon
|About the Language|
About Biao Min