• research_grammar_lab


Generation of Multilingual Wikipedia articles from raw data


In this project we will collect raw data about countries, cities, languages, etc and we will use this data to generate Wikipedia style articles describing the different entities. The natural language generation will be done by using the GF framework and the generation must be done in at least two different languages to demonstrate that the technology is multilingual. The current proposal is to cover the geographical domain but applications in other domains are possible as well. Suggestions for different domains can be discussed with the supervisor.


Krasimir Angelov

GF Resource Grammar for a yet unimplemented language


See http://www.grammaticalframework.org/lib/doc/status.html for the current status. This is an ambitious project and hard work, requiring good knowledge of GF and the target language. But the result will be a permanent contibution to language resources, and almost certainly publishable. For instance the Japanese grammar implemented within this Masters programme was published in JapTAL.


Contact and possible supervisor: Krasimir Angelov.

Simplypedia: Accessible reading of Wikipedia with supportive AAC symbols

The background

Studies have shown that up to 25% of the population somtimes have difficulties reading normal informational text. 7–8% of the population has so severe reading difficulties that easy-to-read material is the only possibility for them to read texts. (http://www.lattlast.se/om-oss)

The group of people with reading difficulties is very diverse, they can be dyslectic, deaf, elderly, immigrants och school children. They can also have cognitive disabilities, such as autism, aphasia, dementia, or intellectual disabilities. Research have shown that especially this latter group can be helped by supportive AAC symbols, such as Blissymbolics (http://www.blissymbolics.org).

There is an easy-to-read variant of the English Wikipedia, called the "Simple English Wikipedia" (http://simple.wikipedia.org). The articles are not automatically simplified from the standard Wikipedia, but they are manually crowd-sourced like all other Wikipedias.

The project

In this project you will enhance the Simple English Wikipedia with supportive AAC symbols. You will create a tool (either a server-based web service, or a stand-alone program, or an auto-generated web site), which is a mirror of Wikipedia, but where the text is enhanced by AAC symbols. Like this:

The tool should read the text of a given Wikipedia article and process it to find out which symbols and where they should be attached. The text must be sentence segmented, POS tagged and parsed, to find out the main verbs and other important words and grammatical relations. Then it can look up each word, or a synonym or a hypo-/hypernym, in a symbol dictionary.  The resulting text should then be displayed for the user.

There are lots of extra things that can be added. One example is to use a Named Entity Extraction module to find names, and then try to match them with pictures or logos. Another example is to make use of the Wikipedia links to find a suitable picture that can be shown together with the link.

The qualifications

You will only use existing NLP components for tagging, parsing etc. But there will be a lot of programming to make the components work well together. NLTK can probably be used a lot, but there will certainly be things that you have to solve outside of NLTK.

The project involves script programming and web programming, so you need to know your Python and a good deal of HTML. Javascript is a plus but not necessary, depending on what kind of tool you want to do.

The future

If you work hard and make a good project, you stand a good chance to get a paper published in the International Workshop on Speech and Language Processing for Assistive Technologies (http://www.slpat.org/slpat2013).

The supervisor

Peter Ljunglöf, Department of Computer Science and Engineering (Data- och informationsteknik).


High-quality translation of web pages in GF


The idea is to build a grammar for some standardized kind of web pages, for instance, personal CV's, which permits high-quality translation. This project has a potential to be carried out with industrial partners, who might also help define and evaluate the results. There is publication potential, but also commercial potential in this task.


Contact and possible supervisor: Aarne Ranta.

Creating a large-scale translation lexicon for GF


This work uses resources like WordNet and Wiktionary to create a translation dictionary usable for open-text translation. The size should be at least 10k lemmas. Similar work has recently been carried out for e.g. Hindi, Finnish, German, and Bulgarian. Like the previous task, this is an ambitious project with the potential to leading to a publication.


Contact and possible supervisor: Aarne Ranta.

GF Resource Grammar for a yet unimplemented language


See http://www.grammaticalframework.org/lib/doc/status.html for the current status. This is an ambitious project and hard work, requiring good knowledge of GF and the target language. But the result will be a permanent contibution to language resources, and almost certainly publishable. For instance the Japanese grammar implemented within this Masters programme was published in JapTAL.


Contact and possible supervisor: Aarne Ranta.

Improving Accuracy and Efficiency of Automatic Software Localization

This project involves working closely with industry to improve accuracy of automatic translation of software interfaces and documentation by exploiting context specificity.

For instance, software source code can be mined to glean the appropriate context for individual messages (e.g., to distinguish a button from an error message). The student(s) will incorporate GF (Grammatical Framework, www.molto-project.eu) grammars into a hybrid translation system that improves on CA Labs' current statistical- and Translation Memory-based methods. User interfaces will enjoy more accurate automatic translations, and error/feedback messages will no longer be generic, but will be adapted to the user's specific interaction scenario. The first goals are to deliver high-quality translations for the most commonly used languages/dialects and to develop an infrastructure to quickly produce acceptable-quality results for new languages. Follow-on work will optimize the translation engine for performance (thereby enabling fast, off-line translation of very large corpora of documents/artifacts).

This project not only involves working closely with researchers and linguists/language experts at CA Labs, but also includes a collaboration with faculty and students at the Universitat Politecnica de Catalunya. Opportunities for either short research visits or longer internships at CA Labs are very good.

S.A. McKee, A. Ranta (Chalmers/GU), V. Montés, P. Paladini (CA Labs Barcelona)

Robust Parsing using Minimum Edit Distance

Assume that you have a grammar for parsing utterances into some kind of semantic interpretation, such as voice commands or dialogue acts. Can you use this grammar for handling also non-grammatical utterances?

A possible example application is if you want to implement a dialogue system for an Android device. Android has built-in speech recognition, but the ASR engine is not grammar driven, so it returns a sentence which does not have to be grammatical.

The idea of this project is to use a Minimum Edit Distance metric, such as the Levenshtein Distance, and try to find the closest sentence that is accepted by the grammar.

Suggested workflow:

1. Start with a grammar G and a test set T of sentences with associated semantics.

2. Generate all possible sentences from the grammar G (up to a given maximum length), with associated semantics. This will give a corpus C, a semantic treebank.

3. For each sentence S in the test set T, find the closest sentence S' that is in C.

4. Evaluate each sentence S by comparing its intended semantics T(S), with the semantics C(S') of the grammatical sentence S'. This will give a Semantic Error Rate (SER).

5. Do this for different possible distance metrics.

The different distance metrics can be
- Levenshtein distance on word level
- Levenshtein distance on character level
- possibly the distance on phoneme or morpheme level

Supervisor: Peter Ljunglöf

Free Dyslexia Software

There are many (more or less successful) attempts of using software to help people with dyslexia.

This project will give an overview of existing software and then develop and evaluate new software.

The supervisor will be Bengt Nordström, Dept of Computer Science and Engineering at Chalmers. He has good contacts with teachers specializing in assisting dyslexic students. The software developed in this work will be licensed under GPL or some similar scheme.

Either a solid background in Language Technology with good programming skills or a solid background in Computer Science with shown interests in Language Technology

MOLTO - Multilingual On-line Translation

MOLTO's goal is to develop a set of tools for translating texts between multiple languages in real time with high quality. Languages are separate modules in the tool and can be varied; prototypes covering a majority of the EU's 23 official languages will be built.

As its main technique, MOLTO uses domain-specific semantic grammars and ontology-based interlinguas. These components are implemented in GF (Grammatical Framework), which is a grammar formalism where multiple languages are related by a common abstract syntax. GF has been applied in several small-to-medium size domains, typically targeting up to ten languages but MOLTO will scale this up in terms of productivity and applicability.

A part of the scale-up is to increase the size of domains and the number of languages. A more substantial part is to make the technology accessible for domain experts without GF expertise and minimize the effort needed for building a translator. Ideally, this can be done by just extending a lexicon and writing a set of example sentences.

The most research-intensive parts of MOLTO are the two-way interoperability between ontology standards (OWL) and GF grammars, and the extension of rule-based translation by statistical methods. The OWL-GF interoperability will enable multilingual natural-language-based interaction with machine-readable knowledge. The statistical methods will add robustness to the system when desired. New methods will be developed for combining GF grammars with statistical translation, to the benefit of both.

MOLTO technology will be released as open-source libraries which can be plugged in to standard translation tools and web pages and thereby fit into standard workflows. It will be demonstrated in web-based demos and applied in three case studies: mathematical exercises in 15 languages, patent data in at least 3 languages, and museum object descriptions in 15 languages.

MOLTO is funded by the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement FP7-ICT-247914

Follow MOLTO on Twitter at http://twitter.com/moltoproject.