• master_thesis_proposal


Developing an adaptive diagnostic vocabulary/grammar test for Swedish (2016)

Developing an adaptive diagnostic vocabulary/grammar test for Swedish


Implement an adaptive diagnostic test for vocabulary and/or grammar for Swedish, based on Second Language Acquisition (SLA) research and frequency statistics available from the COCTAILL corpus.


The currently developed application Lärka, www.spraakbanken.gu.se/larka, is intended for computer-assisted language learning of L2 Swedish. Lärka generates a number of exercises based on corpora available through Korp. Attempts are being made to align generated exercises with CEFR proficiency scales (http://www.coe.int/t/dg4/linguistic/Source/Framework_en.pdf). The actual users, however, may not know their level when they start working with the exercise generator. It is therefore important (and user-friendly) to offer some sort of placement/diagnostic test for those who may need it.

Some examples of existing diagnostic tests for vocabulary are:

Problem description

The aims of this work are the following:

  • to study literature on diagnostic testing for different language skills and competences that are relevant for the CEFR;
  • to find out about other “actors” dealing with CEFR-based tests for Swedish, especially for placement/diagnosis; as a result, to suggest a format for a placement test for one or (better) a range of language skills and competences mentioned in the CEFR
  • to implement the suggested test(s) in the form of web services that can be embedded into Lärka platform (+ eventually develop the user interface module for that). Here it would be interesting, for example, to see formats where free answers could be provided and scored
  • evaluate/test on users (language learners, teachers, linguists, etc.)

Recommended skills:

  • Python
  • interest in Lexical Semantics


  • Elena Volodina, Ildiko Pilan
  • potentially others from Språkbanken

Classification of learner essays by achieved proficiency level

Classification of learner essays by achieved proficiency level


Developing an algorithm (web services) for automatic classification of Swedish learner essays by their reached proficiency level.


Suggested approach would be to use machine learning for essay classification. The challenge is to identify features that would be both aware of the Second Language Acquisition (SLA) research and informative of the task at hand.

The classification will be made in terms of the levels of proficiency according to the Common European Framework of Reference (CEFR), which covers 6 learner levels: A1 (beginner), A2, B1, B2, C1, C2 (near-native). At the moment we have electronic corpora of essays at levels B1, B2, and C1. Essays at A2 are hand-written and haven't yet been digitized and annotated (which presumingly can be done in time for the project, if someone picks this topic).

Problem description

The steps for this project would include:

  • background reading on the topic of SLA, CEFR, essay grading and learner essay classification by levels. See one example for Swedish essay grading (NOT in terms of levels, but in terms of grades, i.e. (Väl/Icke) Godkänd: http://www.ling.su.se/english/nlp/tools/automated-essay-scoring
  • testing approaches for the best-performing classification
  • implementation of web service(s) for learner essay classification
  • (potentially) implementation of Lärka-based user interface where new essays can be tested
  • (potentially) evaluation of the results with teachers & new essays

Recommended skills:

  • Python
  • jQuery
  • interest in machine learning


  • Elena Volodina/Ildiko Pilan
  • potentially others from Språkbanken/FLOV

Overcoming semantic challenges in selection of distractors for multiple-choice vocabulary exercises (2016)

Overcoming semantic challenges in selection of distractors for multiple-choice vocabulary exercises


Find a way to make sure that distractors in multiple-choice activities are genuine (i.e. cannot be used instead of the correct answer) in the context of a sentence/exercise item. This is primarily aimed at Swedish, but other languages are possible candidates as well.


Multiple-choice items for training vocabulary knowledge is a well-documented format of exercise. However, when it comes to the automatic generation of this exercise type, it becomes a complicated problem to select genuinely appropriate distractors. For example, if a learner wants to train vocabulary from the topical domain of “Medical services and SOS”, answer options from the same topical domain might be generated as follows:

Parents couldn't afford to buy the necessary _________.

Choices: pincers, medicine, tablets, blood, hospital, nurse (correct answer in bold)

More than one alternative from the example above can be used to fill the gap (i.e. pincers, medicine, tablets). However, it is important to be able to select such distractors that cannot replace the correct answer, semantically or collocationally viewed, in the context of a sentence, e.g. in the case above to suggest choices: medicine, blood, hospital, nurse, emergency room

Problem description

The aim of this work is thus:

  • to study the literature on the topic of distractor selection, lexical semantics and context modeling
  • implement/test some approach(es) for a semantically aware selection of distractors
  • embed the selection algorithm into Lärka as a web service
  • evaluate/test on users (language learners, teachers, linguists, etc.)

Recommended skills:

  • Python
  • interest in Lexical Semantics


  • Elena Volodina, Ildiko Pilan
  • potentially others from Språkbanken/FLOV

OCR error correction and segmentation

Master thesis project: OCR error correction and segmentation

Goal: Creating better quality text by finding good methods for OCR error correction and text segmentation.

Background: Long term textual archives, spanning over decades or centuries, have the potential of answering many interesting research questions. One can look for changes in language and culture, author influence, writing standards among others. One difficulty in working with historical documents is the quality of the data. Many long term archives contain scanned documents from times when the used OCR technology was far from perfect. In addition, the quality of the documents leaves much to wish for. Due to the vast amount of time needed for this manual procedure, the OCR is not redone. Instead the errors should be corrected using rule-based systems and machine learning techniques. Beyond the problems with OCR, many co-occurrence statistics requires a definition of sentence or paragraph, which in its self can be very challenges in certain times.

Are You interested in working with a large digitized textual archive and develop techniques for correcting OCR errors for documents over 200 years old? Want to find ways to define sentences and paragraphs? Want to be part of a research team working with a new exciting research topic?

Problem description: Applying rule-based and statistical machine learning techniques to improve the quality of a large newspaper archive. Improvements will later be used by Språkbanken and reflected for the archive users.

Recommended skills: Interest in rule-based systems and statistical machine learning techniques Some mathematical background Programming skills A highly motivated 5th year student.

Supervisors: If you are interested or have questions, please contact one of the following: Nina Tahmasebi, Phone:031-786 6953, Email: HYPERLINK "mailto:nina.tahmasebi@gu.se" nina.tahmasebi@gu.se

Text categorization by topics (2016)

Text categorization by topics


Testing/comparing approaches to text categorization/topic modeling based on coursebook texts labeled for topics.


A topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. The main purpose of testing approaches to topic modeling in this project is identification of the best-performing approach that can eventually be used for selection of texts for learners by their topic of preference. These models may eventually be embedded into Lärka, an application developed at Språkbanken for learning Swedish as a second language.

Recently, we have compiled COCTAILL, a corpus of coursebooks for learning Swedish as a second language, where each text is labeled with a topic (or a set of topics). This corpus will form the training/testing data for topic modeling experiments.

Problem description

The aims of this work include the following:

  • to study literature on topic modeling
  • to test/compare several of the suggested ways for text categorization/topic modeling for (some of?) the topics present in the COCTAILL corpus (total of 28 topics used at 5 proficiency levels)
  • apply developed algorithms to some real-life texts (e.g. from Korp or from web) to assess their performance.

Recommended skills:

  • Python, (maybe R)


  • Rickard Johansson/Elena Volodina
  • potentially others from Språkbanken

Part-of-speech tagging/syntactic parsing of emergent texts


The goal of this project is to implement a part-of-speech tagger and investigate the possibilities of developing a syntactic parser that could handle emergent text, i.e. texts – or representations of texts – that are being produced (and thus frequently changed) in order to identify the syntactic location of for example pauses.


In research on language production, pauses and revisions are generally viewed as a window to the underlying cognitive and linguistic processes. In research on written language production these processes are captured by means of keystroke logging programs that records all keystrokes and mouse movements and their temporal distributions. These program generates vast amounts of data which are time consuming to analyse manually. Thus a part-of-speech tagger that could handle emergent text would be of utter importance for quantitative analyses of large language production corpora. Naturally, a syntactic parser would add even more value.

Problem description

To develop an HMM-tagger for emergent texts (primarily in Swedish,  but English texts could also be made available)

To investigate the possibilities of implementing a discourse-based incremental parser for emergent texts and if possible implement it.

Recommended skills:

Good programming skills.


Richard Johansson and Åsa Wengelin

Simplypedia: Accessible reading of Wikipedia with supportive AAC symbols

The background

Studies have shown that up to 25% of the population somtimes have difficulties reading normal informational text. 7–8% of the population has so severe reading difficulties that easy-to-read material is the only possibility for them to read texts. (http://www.lattlast.se/om-oss)

The group of people with reading difficulties is very diverse, they can be dyslectic, deaf, elderly, immigrants och school children. They can also have cognitive disabilities, such as autism, aphasia, dementia, or intellectual disabilities. Research have shown that especially this latter group can be helped by supportive AAC symbols, such as Blissymbolics (http://www.blissymbolics.org).

There is an easy-to-read variant of the English Wikipedia, called the "Simple English Wikipedia" (http://simple.wikipedia.org). The articles are not automatically simplified from the standard Wikipedia, but they are manually crowd-sourced like all other Wikipedias.

The project

In this project you will enhance the Simple English Wikipedia with supportive AAC symbols. You will create a tool (either a server-based web service, or a stand-alone program, or an auto-generated web site), which is a mirror of Wikipedia, but where the text is enhanced by AAC symbols. Like this:

The tool should read the text of a given Wikipedia article and process it to find out which symbols and where they should be attached. The text must be sentence segmented, POS tagged and parsed, to find out the main verbs and other important words and grammatical relations. Then it can look up each word, or a synonym or a hypo-/hypernym, in a symbol dictionary.  The resulting text should then be displayed for the user.

There are lots of extra things that can be added. One example is to use a Named Entity Extraction module to find names, and then try to match them with pictures or logos. Another example is to make use of the Wikipedia links to find a suitable picture that can be shown together with the link.

The qualifications

You will only use existing NLP components for tagging, parsing etc. But there will be a lot of programming to make the components work well together. NLTK can probably be used a lot, but there will certainly be things that you have to solve outside of NLTK.

The project involves script programming and web programming, so you need to know your Python and a good deal of HTML. Javascript is a plus but not necessary, depending on what kind of tool you want to do.

The future

If you work hard and make a good project, you stand a good chance to get a paper published in the International Workshop on Speech and Language Processing for Assistive Technologies (http://www.slpat.org/slpat2013).

The supervisor

Peter Ljunglöf, Department of Computer Science and Engineering (Data- och informationsteknik).


Dialogue-based Search Solution

Masters Thesis proposal (in cooperation with Findwise & Talkamatic)


In an Enterprise environment a search system often crawls and indexes a large number of different data sources – databases, Content Management Systems, external web pages, file shares with different types of documents, etc. Each of the data sources or sub sources may have a primary target group – e.g. Sales, Engineers, Marketing, Doctors, Nurses, etc. all dependent on the type of organization.

The purpose of the (unified) Search system is to serve as a platform (a single entry point) to satisfy the information need for all the different groups in an organization. However, given that search queries are often short (~2.2 words) and ambiguous, and the users have different background, the system employs a number of techniques for filtering of and drill down into the search results. One such technique is facets, e.g. a filtering based on data source, additional keywords, dates, time, etc.

On the other hand, there are at least two types of users (behaviour) - those that know exactly what they look for and how to find it, and use the search in stead of menu clicks; and those  who do not know exactly what they look for, nor where the potential information may be found. We can consider these two groups as the two extremes in much fine-grained scale.

We would like to concentrate on the second group of users, who often engage in some sort of dialogue with the search system. Such users may interact in several ways with the system during a search session – they may rewrite and expand they original query, they may filter it by facets, click on some documents until they finally discover or not what they were looking for.

Dialogue Systems

Spoken dialogue systems are computer systems which use speech as their primary output and input channels. Dialogue systems are primarily used in situations where the visual and tactile channels are not available, for instance while driving, but also to replace human operators for instance in call centers. Recently, spoken dialogue systems have become more widespread with the arrival of Apple’s Siri and Google’s Voice Actions, even outside of the traditional areas of use. As speech and voice has the potential of transmitting large quantities of information very fast compared to traditional GUI interaction, this is a development which is likely to continue.

A spoken dialogue system typically consists of a dialogue manager, an Automatic Speech Recogniser, a Text-to-speech engine, modules for interpretation and generation of utterances and finally some kind of application logic.

Voice search is a term which has emerged the last years. The user speaks a search query, and the system responds by returning a hit list, much like an ordinary Google search. If the hitlist doesn’t contain the desired hit (document, music file, web site etc.) the user needs to do a new voice search with a modified utterance.

The idea of this project is to replace voice search by dialogue-based search, where the user and the system engage in a dialogue over the search results in order to refine the search query. 

Dialogue-based Search – case study

The task of the Masters thesis is to explore the possibilities of using Dialogue-Systems/Dialogue acts in order to satisfy the information needs of certain groups of users in a Search system. The target group consists of several types of users:

  • Users who submit very broad and ambiguous search queries (e.g. “Greece”, “food”, “pm”)
  • Users who do not employ the tools provided by the Search system such as facets (e.g. queries such as “pm pdf”)
  • Users with exploratory queries (e.g. “Abba first album”)

Document format – details

Before documents are being sent for indexing in the Search System, they have been augmented with META-data. The metadata allows us to do a number of things:

  • Advanced queries
  • Filtering
  • Sorting
  • Faceting
  • Ranking

The format of the indexed document could look like:

  <field name="id">6H500F0</field>
  <field name="name">Maxtor DiamondMax 11 - hard drive - 500 GB - SATA-300</field>
  <field name="manufacturer">Maxtor Corp.</field>
  <field name="category">electronics</field>
  <field name="category">hard drive</field>
  <field name="features">SATA 3.0Gb/s, NCQ</field>
  <field name="features">8.5ms seek</field>
  <field name="features">16MB cache</field>
  <field name="price">350</field>
  <field name="popularity">6</field>
  <field name="inStock">true</field>
  <field name="manufacturedate_dt">2006-02-13T15:26:37Z</field>

  <field name=”id”>1</field>
  <field name=”title”>London</field>
  <field name=”body”>London is the capital of UK. London has 7.8 million inhabitants</field>
  <field name=”places”>London</field>
  <field name=”date”>2012-11-30</field>
  <field name=”author”>John Pear</field>
  <field name=”author_email”>john@pear.com</field>
  <field name=”author_phone”>+44 123 456 789</field>


Peter Ljunglöf (CS) together with Findwise AB and Talkamatic AB.

High-quality translation of web pages in GF


The idea is to build a grammar for some standardized kind of web pages, for instance, personal CV's, which permits high-quality translation. This project has a potential to be carried out with industrial partners, who might also help define and evaluate the results. There is publication potential, but also commercial potential in this task.


Contact and possible supervisor: Aarne Ranta.

Creating a large-scale translation lexicon for GF


This work uses resources like WordNet and Wiktionary to create a translation dictionary usable for open-text translation. The size should be at least 10k lemmas. Similar work has recently been carried out for e.g. Hindi, Finnish, German, and Bulgarian. Like the previous task, this is an ambitious project with the potential to leading to a publication.


Contact and possible supervisor: Aarne Ranta.