• research


Language Technology for Languages of the Central African Republic

A small project to make use of data on languages of the Central African Republic.

Duration: Oct 2007-

Participant: Harald Hammarström, PhD Student

This is an-going to project to retrieve (from diskettes):

  • A Sango corpus
  • A Mpiemo text collection and dictionary

Then the resources will be used in the following manner:

  • POS-induction and morphology induction algorithms will run on the Sango and Mpiemo data respectively
  • The corpora will be converted to suitable formats such as TEI XML for use in ITG (a pedagogical tool for teaching grammar based on tagged corpora)
  • A website will be built with information on the languages of the Central African Republic in general and where the resources will be made available

Graphical Functional Morphology (A Small CLT Project)

Graphical Functional Morphology (GFM) is a web-based prototype system for lexical resource development.

Duration: Sep-Oct 2007

Participant: Markus Forsberg, Doctor

Graphical Functional Morphology (GFM) is a web-based prototype system for lexical resource development. It is based on the ideas of Functional Morphology.

GFM includes a type editor, a paradigm editor, and a dictionary editor, together with a generator that supports compilation to full form lexicon. Furthermore, compilation to an XML format is supported, which later can be read back into the system, hence achieving data persistence.

The system is available at: Graphical Functional Morphology. It was developed using Java and Google Web Toolkit (GWT), and consists of 3.3k lines of source code (The source code is available on request). divided into 50 classes, later compiled with GWT into 8k lines of Javascript source code. A quick start document is also available at the web page, together with an example.

Future work

This project allowed the basic infrastructure of the program to be defined, but more work is needed. A non-exhaustive list of possible future work is listed here in no particular order.

  • Enrichment of the dictionary format.
  • Collaborative lexicon development support.
  • Extended regular expression support to enable more fine-grained definitions of the stems.
  • Compact representations of paradigms. This could be looked at from two angles:
    • The user is allowed to define a compact representation through graphical navigation of the type tree.
    • And, perhaps more interestingly, to automatically find a minimal representation of a paradigm definition. However, the intuition of the writer is that this problem is NP-complete. If so, a heuristic would still be very valuable, not only for paradigm definitions, but for case expressions in general, allowing more time and space efficient code (it may be the case that this problem is already addressed in the compiler construction literature).
  • Plugging in Extract into the software, allowing a graphical definition of lexicon extraction rules.
  • Adding compound types and description of compounds.
  • Addressing the problem of stability when paradigms and types are changed or deleted.

The DICO Project

To demonstrate how state-of-the-art spoken language technology can enable access to communication, entertainment and information services as well as to environment control in vehicles.