• old


Mark an item as old if it should not be shown with the newer ones.

Classification of learner essays by achieved proficiency level

Classification of learner essays by achieved proficiency level


Developing an algorithm (web services) for automatic classification of Swedish learner essays by their reached proficiency level.


Suggested approach would be to use machine learning for essay classification. The challenge is to identify features that would be both aware of the Second Language Acquisition (SLA) research and informative of the task at hand.

The classification will be made in terms of the levels of proficiency according to the Common European Framework of Reference (CEFR), which covers 6 learner levels: A1 (beginner), A2, B1, B2, C1, C2 (near-native). At the moment we have electronic corpora of essays at levels B1, B2, and C1. Essays at A2 are hand-written and haven't yet been digitized and annotated (which presumingly can be done in time for the project, if someone picks this topic).

Problem description

The steps for this project would include:

  • background reading on the topic of SLA, CEFR, essay grading and learner essay classification by levels. See one example for Swedish essay grading (NOT in terms of levels, but in terms of grades, i.e. (Väl/Icke) Godkänd: http://www.ling.su.se/english/nlp/tools/automated-essay-scoring
  • testing approaches for the best-performing classification
  • implementation of web service(s) for learner essay classification
  • (potentially) implementation of Lärka-based user interface where new essays can be tested
  • (potentially) evaluation of the results with teachers & new essays

Recommended skills:

  • Python
  • jQuery
  • interest in machine learning


  • Elena Volodina/Ildiko Pilan
  • potentially others from Språkbanken/FLOV

OCR error correction and segmentation

Master thesis project: OCR error correction and segmentation

Goal: Creating better quality text by finding good methods for OCR error correction and text segmentation.

Background: Long term textual archives, spanning over decades or centuries, have the potential of answering many interesting research questions. One can look for changes in language and culture, author influence, writing standards among others. One difficulty in working with historical documents is the quality of the data. Many long term archives contain scanned documents from times when the used OCR technology was far from perfect. In addition, the quality of the documents leaves much to wish for. Due to the vast amount of time needed for this manual procedure, the OCR is not redone. Instead the errors should be corrected using rule-based systems and machine learning techniques. Beyond the problems with OCR, many co-occurrence statistics requires a definition of sentence or paragraph, which in its self can be very challenges in certain times.

Are You interested in working with a large digitized textual archive and develop techniques for correcting OCR errors for documents over 200 years old? Want to find ways to define sentences and paragraphs? Want to be part of a research team working with a new exciting research topic?

Problem description: Applying rule-based and statistical machine learning techniques to improve the quality of a large newspaper archive. Improvements will later be used by Språkbanken and reflected for the archive users.

Recommended skills: Interest in rule-based systems and statistical machine learning techniques Some mathematical background Programming skills A highly motivated 5th year student.

Supervisors: If you are interested or have questions, please contact one of the following: Nina Tahmasebi, Phone:031-786 6953, Email: HYPERLINK "mailto:nina.tahmasebi@gu.se" nina.tahmasebi@gu.se

Part-of-speech tagging/syntactic parsing of emergent texts


The goal of this project is to implement a part-of-speech tagger and investigate the possibilities of developing a syntactic parser that could handle emergent text, i.e. texts – or representations of texts – that are being produced (and thus frequently changed) in order to identify the syntactic location of for example pauses.


In research on language production, pauses and revisions are generally viewed as a window to the underlying cognitive and linguistic processes. In research on written language production these processes are captured by means of keystroke logging programs that records all keystrokes and mouse movements and their temporal distributions. These program generates vast amounts of data which are time consuming to analyse manually. Thus a part-of-speech tagger that could handle emergent text would be of utter importance for quantitative analyses of large language production corpora. Naturally, a syntactic parser would add even more value.

Problem description

To develop an HMM-tagger for emergent texts (primarily in Swedish,  but English texts could also be made available)

To investigate the possibilities of implementing a discourse-based incremental parser for emergent texts and if possible implement it.

Recommended skills:

Good programming skills.


Richard Johansson and Åsa Wengelin

Simplypedia: Accessible reading of Wikipedia with supportive AAC symbols

The background

Studies have shown that up to 25% of the population somtimes have difficulties reading normal informational text. 7–8% of the population has so severe reading difficulties that easy-to-read material is the only possibility for them to read texts. (http://www.lattlast.se/om-oss)

The group of people with reading difficulties is very diverse, they can be dyslectic, deaf, elderly, immigrants och school children. They can also have cognitive disabilities, such as autism, aphasia, dementia, or intellectual disabilities. Research have shown that especially this latter group can be helped by supportive AAC symbols, such as Blissymbolics (http://www.blissymbolics.org).

There is an easy-to-read variant of the English Wikipedia, called the "Simple English Wikipedia" (http://simple.wikipedia.org). The articles are not automatically simplified from the standard Wikipedia, but they are manually crowd-sourced like all other Wikipedias.

The project

In this project you will enhance the Simple English Wikipedia with supportive AAC symbols. You will create a tool (either a server-based web service, or a stand-alone program, or an auto-generated web site), which is a mirror of Wikipedia, but where the text is enhanced by AAC symbols. Like this:

The tool should read the text of a given Wikipedia article and process it to find out which symbols and where they should be attached. The text must be sentence segmented, POS tagged and parsed, to find out the main verbs and other important words and grammatical relations. Then it can look up each word, or a synonym or a hypo-/hypernym, in a symbol dictionary.  The resulting text should then be displayed for the user.

There are lots of extra things that can be added. One example is to use a Named Entity Extraction module to find names, and then try to match them with pictures or logos. Another example is to make use of the Wikipedia links to find a suitable picture that can be shown together with the link.

The qualifications

You will only use existing NLP components for tagging, parsing etc. But there will be a lot of programming to make the components work well together. NLTK can probably be used a lot, but there will certainly be things that you have to solve outside of NLTK.

The project involves script programming and web programming, so you need to know your Python and a good deal of HTML. Javascript is a plus but not necessary, depending on what kind of tool you want to do.

The future

If you work hard and make a good project, you stand a good chance to get a paper published in the International Workshop on Speech and Language Processing for Assistive Technologies (http://www.slpat.org/slpat2013).

The supervisor

Peter Ljunglöf, Department of Computer Science and Engineering (Data- och informationsteknik).


Dialogue-based Search Solution

Masters Thesis proposal (in cooperation with Findwise & Talkamatic)


In an Enterprise environment a search system often crawls and indexes a large number of different data sources – databases, Content Management Systems, external web pages, file shares with different types of documents, etc. Each of the data sources or sub sources may have a primary target group – e.g. Sales, Engineers, Marketing, Doctors, Nurses, etc. all dependent on the type of organization.

The purpose of the (unified) Search system is to serve as a platform (a single entry point) to satisfy the information need for all the different groups in an organization. However, given that search queries are often short (~2.2 words) and ambiguous, and the users have different background, the system employs a number of techniques for filtering of and drill down into the search results. One such technique is facets, e.g. a filtering based on data source, additional keywords, dates, time, etc.

On the other hand, there are at least two types of users (behaviour) - those that know exactly what they look for and how to find it, and use the search in stead of menu clicks; and those  who do not know exactly what they look for, nor where the potential information may be found. We can consider these two groups as the two extremes in much fine-grained scale.

We would like to concentrate on the second group of users, who often engage in some sort of dialogue with the search system. Such users may interact in several ways with the system during a search session – they may rewrite and expand they original query, they may filter it by facets, click on some documents until they finally discover or not what they were looking for.

Dialogue Systems

Spoken dialogue systems are computer systems which use speech as their primary output and input channels. Dialogue systems are primarily used in situations where the visual and tactile channels are not available, for instance while driving, but also to replace human operators for instance in call centers. Recently, spoken dialogue systems have become more widespread with the arrival of Apple’s Siri and Google’s Voice Actions, even outside of the traditional areas of use. As speech and voice has the potential of transmitting large quantities of information very fast compared to traditional GUI interaction, this is a development which is likely to continue.

A spoken dialogue system typically consists of a dialogue manager, an Automatic Speech Recogniser, a Text-to-speech engine, modules for interpretation and generation of utterances and finally some kind of application logic.

Voice search is a term which has emerged the last years. The user speaks a search query, and the system responds by returning a hit list, much like an ordinary Google search. If the hitlist doesn’t contain the desired hit (document, music file, web site etc.) the user needs to do a new voice search with a modified utterance.

The idea of this project is to replace voice search by dialogue-based search, where the user and the system engage in a dialogue over the search results in order to refine the search query. 

Dialogue-based Search – case study

The task of the Masters thesis is to explore the possibilities of using Dialogue-Systems/Dialogue acts in order to satisfy the information needs of certain groups of users in a Search system. The target group consists of several types of users:

  • Users who submit very broad and ambiguous search queries (e.g. “Greece”, “food”, “pm”)
  • Users who do not employ the tools provided by the Search system such as facets (e.g. queries such as “pm pdf”)
  • Users with exploratory queries (e.g. “Abba first album”)

Document format – details

Before documents are being sent for indexing in the Search System, they have been augmented with META-data. The metadata allows us to do a number of things:

  • Advanced queries
  • Filtering
  • Sorting
  • Faceting
  • Ranking

The format of the indexed document could look like:

  <field name="id">6H500F0</field>
  <field name="name">Maxtor DiamondMax 11 - hard drive - 500 GB - SATA-300</field>
  <field name="manufacturer">Maxtor Corp.</field>
  <field name="category">electronics</field>
  <field name="category">hard drive</field>
  <field name="features">SATA 3.0Gb/s, NCQ</field>
  <field name="features">8.5ms seek</field>
  <field name="features">16MB cache</field>
  <field name="price">350</field>
  <field name="popularity">6</field>
  <field name="inStock">true</field>
  <field name="manufacturedate_dt">2006-02-13T15:26:37Z</field>

  <field name=”id”>1</field>
  <field name=”title”>London</field>
  <field name=”body”>London is the capital of UK. London has 7.8 million inhabitants</field>
  <field name=”places”>London</field>
  <field name=”date”>2012-11-30</field>
  <field name=”author”>John Pear</field>
  <field name=”author_email”>john@pear.com</field>
  <field name=”author_phone”>+44 123 456 789</field>


Peter Ljunglöf (CS) together with Findwise AB and Talkamatic AB.

High-quality translation of web pages in GF


The idea is to build a grammar for some standardized kind of web pages, for instance, personal CV's, which permits high-quality translation. This project has a potential to be carried out with industrial partners, who might also help define and evaluate the results. There is publication potential, but also commercial potential in this task.


Contact and possible supervisor: Aarne Ranta.

Creating a large-scale translation lexicon for GF


This work uses resources like WordNet and Wiktionary to create a translation dictionary usable for open-text translation. The size should be at least 10k lemmas. Similar work has recently been carried out for e.g. Hindi, Finnish, German, and Bulgarian. Like the previous task, this is an ambitious project with the potential to leading to a publication.


Contact and possible supervisor: Aarne Ranta.

GF Resource Grammar for a yet unimplemented language


See http://www.grammaticalframework.org/lib/doc/status.html for the current status. This is an ambitious project and hard work, requiring good knowledge of GF and the target language. But the result will be a permanent contibution to language resources, and almost certainly publishable. For instance the Japanese grammar implemented within this Masters programme was published in JapTAL.


Contact and possible supervisor: Aarne Ranta.

Free Robust Parsing


Open speech recognition

Talkamatic build dialogue systems and are currently using a GF-based grammar tool for parsing and generation. A unified language description is compiled into a speech recognition grammar (for Nuance Vocon ASR, PocketSphinx and others), a parser and a generator.

The problem with this is that the parser can only handle the utterances which the ASR can recognize fom the ASR grammar. The parser is thus not robust, and if an open dictation grammar is used (such as Dragon Dictate used in Apple’s Siri) the parser is mostly useless.


Currently TDM (the Talkamatic Dialogue Manager) requires all concepts used in the dialogue to be known in advance. Hence, for a dialogue-controlled music player, all artists, songs, genres etc. need to be known and explicitly declared beforehand.

There are disadvantages with this approach. For example, it requires access to a extensive music database in order to be able to build a dialogue interface for a music player.

Problem description

To simplify the building of dialogue interfaces for this kind of application, it would be useful to have a  more robust parse, which can identify sequences of dialogue moves from arbitrary user input strings.



Dialogue Moves

”Play Like a Prayer with Madonna”



answer(”Like a Prayer”:song_title)


”Play Sisters of Mercy”


answer(”Sisters of Mercy”:song_name)

”Play Sisters of Mercy”


answer(”Sisters of Mercy”:artist_name)

”I would like to listen to Jazz”





Several different methods can be used: Named Entity Recognizers, regular expressions, databases etc., or combinations of several of these. A strong requirement is that the parser should be built automatically or semiautomatically from a small corpus or database. Computational efficiency is also desirable but less important. The parser must have a Python interface and run on Linux.


Peter Ljunglöf, Chalmers Data- och informationsteknik or Staffan Larsson, FLoV, together with Talkamatic AB. Talkamatic is a university research spin-off company based in Göteborg.


A smallcompensation may be paid by Talkamatic AB when the thesis is completed.


Spelling games


Developing an algorithm (web service(s)) for automatic generation of exercises for training spelling (primarily, for Swedish)


The currently developed application Lärka, with its web services, is used for computer-assisted language learning. Lärka generates a number of exercises based on corpora (and their annotation) available through Korp. Vocabulary knowledge covers a broad spectrum of word knowledge, spelling and recognition in speech being some of them.

Problem description

The aims of this work are:

  1. to implement web service(s) for adaptive spelling exercise generation using text-to-speech module for Swedish, where the target words/phrases will be pronounced фnd the student will have to type what he/she hears. If the user seems to be comfortable with word spellings, target words get longer, get inflected, or pronounced in phrases, sentences etc.
  2. analyze possible approaches to provide reasonable feedback
  3. implement user interface for the exercise to be used in Lärka
  4. create a database for storing all possible misspellings associated with each individual graphical word for future analysis (for better feedback or better adaptivity path)
  5. (potentially) evaluate the results.

Recommended skills:

  • Python
  • Query


  • Elena Volodina/Torbjörn Lager
  • eventually others from Språkbanken/FLOV